Collector randomly stops sending spans #31758

MarcinGinszt · 2024-03-14T09:47:21Z

Component(s)

No response

What happened?

Description

Otel-collector randomly stops sending spans. We encountered this situation twice this week. It happens to just one of the collector pods, the rest works correctly. We are notified by alert about sending queue being full- after inspecting pod metrics, turns out that it is caused by otelcol_exporter_sent_spans dropping to 0.

There is nothing in the logs before the error about sending queue being full.

Are there some additional ways to diagnose the issue before resorting to pprof?

Steps to Reproduce

Expected Result

Actual Result

Collector version

0.95.0

Environment information

Environment

https://github.com/utilitywarehouse/opentelemetry-manifests

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        keepalive:
          server_parameters:
            max_connection_age: 5m
            max_connection_age_grace: 1m
            max_connection_idle: 10m
        # Accept up to 4MB message
        max_recv_msg_size_mib: 4
      http:

processors:
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 20
  batch:
    # Kafka is limited to a 128MB payload, so we keep in mind
    # that as we can receive up to 4MB messages, we need to
    # keep the batching size low enough to not exceed Kafka's.
    timeout: 10ms
    send_batch_size: 30
    send_batch_max_size: 30
  resource:
    attributes:
      - key: deployment.environment
        value: prod
        action: insert
(... some other attributes)

extensions:
  health_check: {}
  zpages: {}

exporters:
  kafka:
    protocol_version: 2.6.0
    client_id: "otel-collector"
    timeout: 2s
    partition_traces_by_id: true
    brokers:
      - kafka1.svc.cluster:9092
      - kafka2.svc.cluster:9092
      - kafka3.svc.cluster:9092
    topic: "otel.otlp_spans"
    auth:
      tls:
        ca_file: /kafka-client-certificate/ca.crt
        cert_file: /kafka-client-certificate/tls.crt
        key_file: /kafka-client-certificate/tls.key
        reload_interval: 1h
    retry_on_failure:
      initial_interval: 2s
      max_interval: 10s
      max_elapsed_time: 60s
    sending_queue:
      num_consumers: 20
      queue_size: 12000 # 200 req/s * 60s
    producer:
      max_message_bytes: 125829120 # 120MB
      compression: zstd

service:
  extensions: [health_check, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [kafka]

Log output

No response

Additional context

No response

github-actions · 2024-03-14T15:34:43Z

Pinging code owners for exporter/kafka: @pavolloffay @MovieStoreGuy. See Adding Labels via Comments if you do not have permissions to add labels yourself.

MarcinGinszt · 2024-03-19T13:35:09Z

We are continuing to encounter this issue once- twice daily. It happens only on our environment with biggest traffic.
It's not a matter of resource consumption- resource usage is around 50% of Kubernetes limits.
We have three collector pods running- it affects any number of them (1, 2 or 3) simultaneously (e.g. - two pods are stopping to produce at the exact same moment, third one works as usual).
Restarting the pods fixes the situation.

There is nothing in the logs (debug level).

We analyzed the pprof profiles and goroutines graphs for working and broken collector- broken collector doesn't run sarama producer process.

I'm attaching the profile graphs here (for ok and broken collectors- for comparison)

Ok profile:

Broken profile:

Ok goroutine:

Broken goroutine:

MarcinGinszt · 2024-03-27T11:27:26Z

EDIT: this doesn't happen for each broken pod- most of them don't record any error

We found some error with the debug/tracez endpoint:

looks like publishing process silently terminates because of context deadline exceeded in
opentelemetry.proto.collector.trace.v1.traceservice/export

github-actions · 2024-05-30T03:30:14Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/kafka: @pavolloffay @MovieStoreGuy

See Adding Labels via Comments if you do not have permissions to add labels yourself.

rkargMsft · 2024-07-01T19:18:08Z

We're encountering this same issue using:

otel/opentelemetry-collector-k8s:0.102.1

No errors logged other than the send queue being full.

rkargMsft · 2024-07-01T20:52:50Z

May be open-telemetry/opentelemetry-collector#10315

github-actions · 2024-09-02T03:32:12Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/kafka: @pavolloffay @MovieStoreGuy

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-11-01T05:20:09Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

MarcinGinszt added bug Something isn't working needs triage New item requiring triage labels Mar 14, 2024

crobert-1 added the exporter/kafka label Mar 14, 2024

This was referenced Mar 19, 2024

Weekly Report: 2024-03-12 - 2024-03-19 #31825

Closed

Weekly Report: 2024-03-12 - 2024-03-19 asuresh4/opentelemetry-collector-contrib#11544

Open

github-actions bot mentioned this issue Mar 26, 2024

Weekly Report: 2024-03-19 - 2024-03-26 #31947

Closed

atoulme removed the needs triage New item requiring triage label Mar 30, 2024

github-actions bot added the Stale label May 30, 2024

crobert-1 removed the Stale label May 30, 2024

github-actions bot added the Stale label Sep 2, 2024

github-actions bot added the closed as inactive label Nov 1, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collector randomly stops sending spans #31758

Collector randomly stops sending spans #31758

MarcinGinszt commented Mar 14, 2024

github-actions bot commented Mar 14, 2024

MarcinGinszt commented Mar 19, 2024 •

edited

Loading

MarcinGinszt commented Mar 27, 2024 •

edited

Loading

github-actions bot commented May 30, 2024

rkargMsft commented Jul 1, 2024 •

edited

Loading

rkargMsft commented Jul 1, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Nov 1, 2024

Collector randomly stops sending spans #31758

Collector randomly stops sending spans #31758

Comments

MarcinGinszt commented Mar 14, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Mar 14, 2024

MarcinGinszt commented Mar 19, 2024 • edited Loading

MarcinGinszt commented Mar 27, 2024 • edited Loading

github-actions bot commented May 30, 2024

rkargMsft commented Jul 1, 2024 • edited Loading

rkargMsft commented Jul 1, 2024

github-actions bot commented Sep 2, 2024

github-actions bot commented Nov 1, 2024

MarcinGinszt commented Mar 19, 2024 •

edited

Loading

MarcinGinszt commented Mar 27, 2024 •

edited

Loading

rkargMsft commented Jul 1, 2024 •

edited

Loading