Ingestion of metrics or logs very slow (and high CPU usage on collector) #33427

rpasche · 2024-06-07T09:32:10Z

Component(s)

cmd/telemetrygen

Describe the issue you're reporting

Hi,

I am running the otel collector-contrib version 0.100.0 with this configuration

exporters:
  otlp/apm:
    endpoint: server:1234
    headers:
      Authorization: Bearer ${env:SECRET_TOKEN}
    tls:
      ca_file: /certs/ca.pem
      cert_file: /certs/tls.crt
      insecure_skip_verify: true
      key_file: /certs/tls.key
extensions:
  basicauth/server:
    htpasswd:
      file: /basicauth/.htpasswd
  health_check: {}
  zpages: {}
processors:
  attributes/tenantid:
    actions:
    - action: upsert
      key: tenant.id
      value: ${env:TENANT_ID}
  batch: {}
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 19
receivers:
  otlp:
    protocols:
      grpc:
        auth:
          authenticator: basicauth/server
        endpoint: ${MY_POD_IP}:4317
        tls:
          ca_file: /certs/ca.pem
          cert_file: /certs/tls.crt
          key_file: /certs/tls.key
      http:
        auth:
          authenticator: basicauth/server
        endpoint: ${MY_POD_IP}:4318
        tls:
          ca_file: /certs/ca.pem
          cert_file: /certs/tls.crt
          key_file: /certs/tls.key
service:
  extensions:
  - health_check
  - zpages
  - basicauth/server
  pipelines:
    logs:
      exporters:
      - otlp/apm
      processors:
      - memory_limiter
      - attributes/tenantid
      - batch
      receivers:
      - otlp
      - syslog
      - tcplog
    metrics:
      exporters:
      - otlp/apm
      processors:
      - memory_limiter
      - attributes/tenantid
      - batch
      receivers:
      - otlp
    traces:
      exporters:
      - otlp/apm
      processors:
      - memory_limiter
      - attributes/tenantid
      - batch
      receivers:
      - otlp
  telemetry:
    logs:
      encoding: json
      level: debug
    metrics:
      address: 0.0.0.0:8888
      level: detailed

I'm running this collector in a k8s POD (one container only) and the collector has 2 CPU and 4 GB memory attached.

When running the telemetry-gen command to send 1000 traces, this finishes within 1 second.

If I run - basically - the same command to send either 1000 logs or metrics, this takes ~ 2 - 3 minutes to complete. Looking into Grafana and the collector dashboard only ~ 20 metrics / s or only ~ 8 logs / s are received on the collector.

Additionally, the CPU load when sending metrics or logs goes very high. It is around 60%.

Compared to this, running a trace send for 5m with 1 worker, the CPU usage on the collector also goes to ~ 60 %, but it's processing ~ 4k spans/s.

All time, there is no queuing shown within the collector.

Example commands I ran

~/go/bin/telemetrygen traces --otlp-endpoint otlp-endpoint:4317 --otlp-header Authorization=\"Basic\ base64...\" --service=my-custom-service --duration=5m --rate=0

Similar commands were used to send 1000 metrics or logs (--metrics 1000 or --logs 1000 without setting the --duration)

See some screenshots while trying to send 1000 logs to the collector

Receiver:

Processor:

Exporters:

Collector stats:

Any hint what is going on here or what I am doing wrong?

Thanks

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-07T09:32:28Z

Pinging code owners:

cmd/telemetrygen: @mx-psi @codeboten

See Adding Labels via Comments if you do not have permissions to add labels yourself.

cforce · 2024-06-08T16:13:01Z

@rpasche I am not sure if component "cmd/telemetrygen" and the code owners can help here. I i understand it correctly you are complaining about high cpu on the collector side for log and metrics and not at the load generator "cmd/telemetrygen".
From the graphs its look like the log load is adding the cpu, so you shall maybe switch on collector profiling and check in which go modules, packages, methods the cpu is mostly located.
I don't see very specific otel modules, so you expectation might be the otel core shall have less cpu fooprint

rpasche · 2024-06-11T07:16:52Z

Hi @cforce , I activated the pprof extension and noticed, that the collector spended most time on the decrypt function of bcrypt. So I disabled authentication while testing, but still, sending traces worked as before, while seding logs or metrics is still pretty slow.

I think I know notice, what is causing this. In metrics and logs, every single metric or log is sent on its own. There is no batch or metrics, even if --rate=0 is set. So in this case, most likely latency is an issue and this (with one worker) currently results in ~ 20 metrics or logs / second.

rpasche · 2024-06-12T07:07:21Z

This is a poor adjustment of the code, but it - basically - works to now also send logs and traces in batches of 1000 signals.

--- a/cmd/telemetrygen/internal/metrics/worker.go
+++ b/cmd/telemetrygen/internal/metrics/worker.go
@@ -46,44 +46,62 @@ func (w worker) simulateMetrics(res *resource.Resource, exporterFunc func() (sdk
 		}
 	}()

-	var i int64
+	var i, batch int64
+
+	const BATCHSIZE = 1000
+
 	for w.running.Load() {
 		var metrics []metricdata.Metrics
-
-		switch w.metricType {
-		case metricTypeGauge:
-			metrics = append(metrics, metricdata.Metrics{
-				Name: w.metricName,
-				Data: metricdata.Gauge[int64]{
-					DataPoints: []metricdata.DataPoint[int64]{
-						{
-							Time:       time.Now(),
-							Value:      i,
-							Attributes: attribute.NewSet(signalAttrs...),
-							Exemplars:  w.exemplars,
+		batch = 0
+		for {
+			switch w.metricType {
+			case metricTypeGauge:
+				metrics = append(metrics, metricdata.Metrics{
+					Name: w.metricName,
+					Data: metricdata.Gauge[int64]{
+						DataPoints: []metricdata.DataPoint[int64]{
+							{
+								Time:       time.Now(),
+								Value:      i,
+								Attributes: attribute.NewSet(signalAttrs...),
+								Exemplars:  w.exemplars,
+							},
 						},
 					},
-				},
-			})
-		case metricTypeSum:
-			metrics = append(metrics, metricdata.Metrics{
-				Name: w.metricName,
-				Data: metricdata.Sum[int64]{
-					IsMonotonic: true,
-					Temporality: metricdata.CumulativeTemporality,
-					DataPoints: []metricdata.DataPoint[int64]{
-						{
-							StartTime:  time.Now().Add(-1 * time.Second),
-							Time:       time.Now(),
-							Value:      i,
-							Attributes: attribute.NewSet(signalAttrs...),
-							Exemplars:  w.exemplars,
+				})
+			case metricTypeSum:
+				metrics = append(metrics, metricdata.Metrics{
+					Name: w.metricName,
+					Data: metricdata.Sum[int64]{
+						IsMonotonic: true,
+						Temporality: metricdata.CumulativeTemporality,
+						DataPoints: []metricdata.DataPoint[int64]{
+							{
+								StartTime:  time.Now().Add(-1 * time.Second),
+								Time:       time.Now(),
+								Value:      i,
+								Attributes: attribute.NewSet(signalAttrs...),
+								Exemplars:  w.exemplars,
+							},
 						},
 					},
-				},
-			})
-		default:
-			w.logger.Fatal("unknown metric type")
+				})
+			default:
+				w.logger.Fatal("unknown metric type")
+			}
+
+			i++
+			batch++
+			if err := limiter.Wait(context.Background()); err != nil {
+				w.logger.Fatal("limiter wait failed, retry", zap.Error(err))
+			}
+			if w.numMetrics != 0 && i >= int64(w.numMetrics) {
+				break
+			}
+			// break if BATCHSIZE is reached
+			if batch >= BATCHSIZE {
+				break
+			}
 		}

 		rm := metricdata.ResourceMetrics{
@@ -94,11 +112,6 @@ func (w worker) simulateMetrics(res *resource.Resource, exporterFunc func() (sdk
 		if err := exporter.Export(context.Background(), &rm); err != nil {
 			w.logger.Fatal("exporter failed", zap.Error(err))
 		}
-		if err := limiter.Wait(context.Background()); err != nil {
-			w.logger.Fatal("limiter wait failed, retry", zap.Error(err))
-		}
-
-		i++
 		if w.numMetrics != 0 && i >= int64(w.numMetrics) {
 			break
 		}

After building, running

./telemetrygen logs --otlp-endpoint otlp-grpc-endpoint:4317 --otlp-header Authorization=\"Basic\ XXXXX\" --duration=5m --rate=0

looks like this

Receiver + Processor

Exporter

In that 5 minutes, that one worker sent 5572000 logs

...
2024-06-12T09:02:07.276+0200	INFO	logs/worker.go:107	logs generated	{"worker": 0, "logs": 5572000}

github-actions · 2024-08-12T03:32:50Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

cmd/telemetrygen: @mx-psi @codeboten

See Adding Labels via Comments if you do not have permissions to add labels yourself.

cforce · 2024-08-12T06:12:45Z

@rpasche Would be great if you improve your code and upstream it as a PR finally.
Tx for your contribution

rpasche added the needs triage New item requiring triage label Jun 7, 2024

github-actions bot added the cmd/telemetrygen telemetrygen command label Jun 7, 2024

github-actions bot mentioned this issue Jul 2, 2024

Weekly Report: 2024-06-25 - 2024-07-02 #33839

Closed

rpasche mentioned this issue Jul 2, 2024

apm-server drops lot of metrics. Unclear, where this happens elastic/apm-server#13541

Closed

github-actions bot mentioned this issue Jul 9, 2024

Weekly Report: 2024-07-02 - 2024-07-09 #33962

Closed

This was referenced Jul 16, 2024

Weekly Report: 2024-07-09 - 2024-07-16 #34087

Closed

Weekly Report: 2024-07-16 - 2024-07-23 #34202

Closed

This was referenced Jul 30, 2024

Weekly Report: 2024-07-23 - 2024-07-30 #34301

Closed

Weekly Report: 2024-07-30 - 2024-08-06 #34410

Closed

github-actions bot added the Stale label Aug 12, 2024

github-actions bot mentioned this issue Aug 13, 2024

Weekly Report: 2024-08-06 - 2024-08-13 #34626

Closed

github-actions bot removed the Stale label Aug 13, 2024

github-actions bot mentioned this issue Aug 20, 2024

Weekly Report: 2024-08-13 - 2024-08-20 #34743

Closed

github-actions bot mentioned this issue Aug 27, 2024

Weekly Report: 2024-08-20 - 2024-08-27 #34856

Closed

github-actions bot mentioned this issue Sep 3, 2024

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

This was referenced Sep 10, 2024

Weekly Report: 2024-09-03 - 2024-09-10 #35086

Closed

Weekly Report: 2024-09-10 - 2024-09-17 #35228

Closed

This was referenced Sep 24, 2024

Weekly Report: 2024-09-17 - 2024-09-24 #35377

Closed

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Closed

atoulme removed the needs triage New item requiring triage label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingestion of metrics or logs very slow (and high CPU usage on collector) #33427

Ingestion of metrics or logs very slow (and high CPU usage on collector) #33427

rpasche commented Jun 7, 2024

github-actions bot commented Jun 7, 2024

cforce commented Jun 8, 2024

rpasche commented Jun 11, 2024

rpasche commented Jun 12, 2024

github-actions bot commented Aug 12, 2024

cforce commented Aug 12, 2024