Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove log from Persistence storage only if exporter is able to consume the log and respond with 200 status code #36584

Open
amanmoar277 opened this issue Nov 28, 2024 · 5 comments

Comments

@amanmoar277
Copy link

amanmoar277 commented Nov 28, 2024

Component(s)

exporter/elasticsearch, exporter/kafka, receiver/kafka

Is your feature request related to a problem? Please describe.

I am collecting logs via HTTP endpoint and sending this data to elasticsearch(ES) using elasticsearchexporter.

Everything is working perfectly fine if the ES instance is up and working.
But here are two issue

  1. What will happen if ES is down and we are continuously getting logs over HTTP receiver?
  2. What will happen to in memory logs which are accumulated when ES was down?

Few points -

  1. I am already using retry option of elasticsearchexporter, but things are not working well if there are a lot of logs received per seconds.
  2. It is suggested to use kafka as persistence buffer, but the same issue will occur when logs are consumed from kafka but ES is down at that time. Because as of now I didn't found and acknowledgement related control on kafka receiver.

The end goal is -

  1. No logs should me dropped
  2. If ES is down for sometime, logs should be properly aggregated and replayed once ES is up (retry is not working fine here)
  3. If something went wrong with opentelemetry instance, no logs should be lost

Please suggest some approach on this.

Describe the solution you'd like

Remove log from Persistence storage only if exporter is able to consume the log and respond with 200 status code.
Ensure delivery to exporter, once or atleast once.

Describe alternatives you've considered

I have tried the following approaches

  1. HTTP -> ES
  2. HTTP -> KAFKA -> ES

But same issue can occur in case of kafka also.

Additional context

No response

Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@carsonip
Copy link
Contributor

In your case, I believe it will make sense to enable persistence backed sending_queue, with elasticsearchexporter config batcher::enabled (see docs) set to true, and number of retries set to a very high number. This should address "If ES is down for sometime, logs should be properly aggregated and replayed once ES is up" and "If something went wrong with opentelemetry instance, no logs should be lost". But even so, the sending queue has a limit on the number of requests stored, and any new logs will be rejected by the queue when the limit is reached.

@carsonip
Copy link
Contributor

/label -needs-triage

@github-actions github-actions bot removed the needs triage New item requiring triage label Nov 28, 2024
@amanmoar277
Copy link
Author

amanmoar277 commented Dec 2, 2024

Hi @carsonip, I tried these options but these are not meeting the expectations.
I can only see few of the logs being processed on retry. I am using file_storage for persistence storage.
But with this config, things can go wrong.

Is there any other approach which can guarantee delivery of logs to elasticsearch atleast once?

I am using the following config.

receivers:
  otlp:
    protocols:
      grpc: {}
      http: 
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins:
            - "http://*"
            - "https://*"

processors:
  batch:
    send_batch_size: 1000000
    timeout: 10s

extensions:
  file_storage:
    directory: /var/lib/storage/otc
    timeout: 10s
    fsync: false
    create_directory: true


exporters:
  debug:
    verbosity: detailed
  elasticsearch:
    endpoint: "http://elasticsearch:9200"
    timeout: 30s

    sending_queue:
      enabled: true
      queue_size: 9000
      num_consumers: 1
      storage: file_storage

    batcher:
      enabled: true
      min_size_items: 10
      max_size_items: 0
      flush_timeout: 10s

    logs_index: "qa-opentelemetry-otel-logs"


    mapping:
      mode: raw
    
    num_workers: 1

    flush:
      bytes: 1000000000
      interval: 10s

    retry:
      enabled: true
      max_retries: 2000
      initial_interval: 500ms
      max_interval: 10m
      retry_on_status: [429, 500, 501, 502, 503, 504]

    discover:
      on_start: true
      interval: 2s

    telemetry:
      log_request_body: true
      log_response_body: true

service:
  extensions: [file_storage]
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, elasticsearch]
      

And I am using this docker file

version: '3.8'

services:
  opentelemetry-collector:
    image: otel/opentelemetry-collector-contrib:latest
    container_name: otel-collector
    ports:
      - "4317:4317"
      - "4318:4318"
      - "55681:55681"
    volumes:
      - /./otel-collector-config.yml:/etc/otel-collector-config.yaml
      - ./queueLogs:/var/lib/storage/otc
    command: ["--config", "/etc/otel-collector-config.yaml"]
    environment:
      LOG_LEVEL: debug

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.13.2
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
      - "9300:9300"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.13.2
    container_name: kibana
    depends_on:
      - elasticsearch
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"

@carsonip
Copy link
Contributor

carsonip commented Dec 4, 2024

I tried these options but these are not meeting the expectations.
I can only see few of the logs being processed on retry. I am using file_storage for persistence storage.
But with this config, things can go wrong.

Sorry, do you mind elaborating on this? The config looks good to me.

  • When you say "few of the logs being processed on retry", do you mean others are dropped?
  • Do you have any numbers on how many logs are sent to collector, and how many are in ES?
  • Do you have a synthetic data set for me to reproduce the issue?
  • Are there any error logs?
  • Do you have visibility into the queue size? You should be able to check queue size with internal telemetry.
  • I see "sending_queue::num_consumers: 1" and "num_workers: 1". Under a high load, this is possibly too slow to index to ES, and will cause backpressure and fill up the queue. Try 10 for both of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants