Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collector randomly stops sending spans #31758

Closed
MarcinGinszt opened this issue Mar 14, 2024 · 8 comments
Closed

Collector randomly stops sending spans #31758

MarcinGinszt opened this issue Mar 14, 2024 · 8 comments

Comments

@MarcinGinszt
Copy link

Component(s)

No response

What happened?

Description

Otel-collector randomly stops sending spans. We encountered this situation twice this week. It happens to just one of the collector pods, the rest works correctly. We are notified by alert about sending queue being full- after inspecting pod metrics, turns out that it is caused by otelcol_exporter_sent_spans dropping to 0.
image

There is nothing in the logs before the error about sending queue being full.

Are there some additional ways to diagnose the issue before resorting to pprof?

Steps to Reproduce

Expected Result

Actual Result

Collector version

0.95.0

Environment information

Environment

https://github.com/utilitywarehouse/opentelemetry-manifests

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        keepalive:
          server_parameters:
            max_connection_age: 5m
            max_connection_age_grace: 1m
            max_connection_idle: 10m
        # Accept up to 4MB message
        max_recv_msg_size_mib: 4
      http:

processors:
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 20
  batch:
    # Kafka is limited to a 128MB payload, so we keep in mind
    # that as we can receive up to 4MB messages, we need to
    # keep the batching size low enough to not exceed Kafka's.
    timeout: 10ms
    send_batch_size: 30
    send_batch_max_size: 30
  resource:
    attributes:
      - key: deployment.environment
        value: prod
        action: insert
(... some other attributes)

extensions:
  health_check: {}
  zpages: {}

exporters:
  kafka:
    protocol_version: 2.6.0
    client_id: "otel-collector"
    timeout: 2s
    partition_traces_by_id: true
    brokers:
      - kafka1.svc.cluster:9092
      - kafka2.svc.cluster:9092
      - kafka3.svc.cluster:9092
    topic: "otel.otlp_spans"
    auth:
      tls:
        ca_file: /kafka-client-certificate/ca.crt
        cert_file: /kafka-client-certificate/tls.crt
        key_file: /kafka-client-certificate/tls.key
        reload_interval: 1h
    retry_on_failure:
      initial_interval: 2s
      max_interval: 10s
      max_elapsed_time: 60s
    sending_queue:
      num_consumers: 20
      queue_size: 12000 # 200 req/s * 60s
    producer:
      max_message_bytes: 125829120 # 120MB
      compression: zstd

service:
  extensions: [health_check, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [kafka]

Log output

No response

Additional context

No response

@MarcinGinszt MarcinGinszt added bug Something isn't working needs triage New item requiring triage labels Mar 14, 2024
Copy link
Contributor

Pinging code owners for exporter/kafka: @pavolloffay @MovieStoreGuy. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@MarcinGinszt
Copy link
Author

MarcinGinszt commented Mar 19, 2024

We are continuing to encounter this issue once- twice daily. It happens only on our environment with biggest traffic.
It's not a matter of resource consumption- resource usage is around 50% of Kubernetes limits.
We have three collector pods running- it affects any number of them (1, 2 or 3) simultaneously (e.g. - two pods are stopping to produce at the exact same moment, third one works as usual).
Restarting the pods fixes the situation.

There is nothing in the logs (debug level).

We analyzed the pprof profiles and goroutines graphs for working and broken collector- broken collector doesn't run sarama producer process.

I'm attaching the profile graphs here (for ok and broken collectors- for comparison)

Ok profile:
ok
Broken profile:
broken

Ok goroutine:
ok-goroutine
Broken goroutine:
broken-goroutine

@MarcinGinszt
Copy link
Author

MarcinGinszt commented Mar 27, 2024

EDIT: this doesn't happen for each broken pod- most of them don't record any error

We found some error with the debug/tracez endpoint:
image

looks like publishing process silently terminates because of context deadline exceeded in
opentelemetry.proto.collector.trace.v1.traceservice/export

@atoulme atoulme removed the needs triage New item requiring triage label Mar 30, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 30, 2024
@crobert-1 crobert-1 removed the Stale label May 30, 2024
@rkargMsft
Copy link

rkargMsft commented Jul 1, 2024

We're encountering this same issue using:

otel/opentelemetry-collector-k8s:0.102.1

No errors logged other than the send queue being full.

@rkargMsft
Copy link

Copy link
Contributor

github-actions bot commented Sep 2, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 2, 2024
Copy link
Contributor

github-actions bot commented Nov 1, 2024

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants