Remove log from Persistence storage only if exporter is able to consume the log and respond with 200 status code #36584

amanmoar277 · 2024-11-28T13:11:11Z

Component(s)

exporter/elasticsearch, exporter/kafka, receiver/kafka

Is your feature request related to a problem? Please describe.

I am collecting logs via HTTP endpoint and sending this data to elasticsearch(ES) using elasticsearchexporter.

Everything is working perfectly fine if the ES instance is up and working.
But here are two issue

What will happen if ES is down and we are continuously getting logs over HTTP receiver?
What will happen to in memory logs which are accumulated when ES was down?

Few points -

I am already using retry option of elasticsearchexporter, but things are not working well if there are a lot of logs received per seconds.
It is suggested to use kafka as persistence buffer, but the same issue will occur when logs are consumed from kafka but ES is down at that time. Because as of now I didn't found and acknowledgement related control on kafka receiver.

The end goal is -

No logs should me dropped
If ES is down for sometime, logs should be properly aggregated and replayed once ES is up (retry is not working fine here)
If something went wrong with opentelemetry instance, no logs should be lost

Please suggest some approach on this.

Describe the solution you'd like

Remove log from Persistence storage only if exporter is able to consume the log and respond with 200 status code.
Ensure delivery to exporter, once or atleast once.

Describe alternatives you've considered

I have tried the following approaches

HTTP -> ES
HTTP -> KAFKA -> ES

But same issue can occur in case of kafka also.

Additional context

No response

github-actions · 2024-11-28T13:11:29Z

Pinging code owners:

exporter/elasticsearch: @JaredTan95 @carsonip @lahsivjar
exporter/kafka: @pavolloffay @MovieStoreGuy
receiver/kafka: @pavolloffay @MovieStoreGuy

See Adding Labels via Comments if you do not have permissions to add labels yourself.

carsonip · 2024-11-28T13:47:00Z

In your case, I believe it will make sense to enable persistence backed sending_queue, with elasticsearchexporter config batcher::enabled (see docs) set to true, and number of retries set to a very high number. This should address "If ES is down for sometime, logs should be properly aggregated and replayed once ES is up" and "If something went wrong with opentelemetry instance, no logs should be lost". But even so, the sending queue has a limit on the number of requests stored, and any new logs will be rejected by the queue when the limit is reached.

carsonip · 2024-11-28T13:47:14Z

/label -needs-triage

amanmoar277 · 2024-12-02T06:21:19Z

Hi @carsonip, I tried these options but these are not meeting the expectations.
I can only see few of the logs being processed on retry. I am using file_storage for persistence storage.
But with this config, things can go wrong.

Is there any other approach which can guarantee delivery of logs to elasticsearch atleast once?

I am using the following config.

receivers:
  otlp:
    protocols:
      grpc: {}
      http: 
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins:
            - "http://*"
            - "https://*"

processors:
  batch:
    send_batch_size: 1000000
    timeout: 10s

extensions:
  file_storage:
    directory: /var/lib/storage/otc
    timeout: 10s
    fsync: false
    create_directory: true


exporters:
  debug:
    verbosity: detailed
  elasticsearch:
    endpoint: "http://elasticsearch:9200"
    timeout: 30s

    sending_queue:
      enabled: true
      queue_size: 9000
      num_consumers: 1
      storage: file_storage

    batcher:
      enabled: true
      min_size_items: 10
      max_size_items: 0
      flush_timeout: 10s

    logs_index: "qa-opentelemetry-otel-logs"


    mapping:
      mode: raw
    
    num_workers: 1

    flush:
      bytes: 1000000000
      interval: 10s

    retry:
      enabled: true
      max_retries: 2000
      initial_interval: 500ms
      max_interval: 10m
      retry_on_status: [429, 500, 501, 502, 503, 504]

    discover:
      on_start: true
      interval: 2s

    telemetry:
      log_request_body: true
      log_response_body: true

service:
  extensions: [file_storage]
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, elasticsearch]

And I am using this docker file

version: '3.8'

services:
  opentelemetry-collector:
    image: otel/opentelemetry-collector-contrib:latest
    container_name: otel-collector
    ports:
      - "4317:4317"
      - "4318:4318"
      - "55681:55681"
    volumes:
      - /./otel-collector-config.yml:/etc/otel-collector-config.yaml
      - ./queueLogs:/var/lib/storage/otc
    command: ["--config", "/etc/otel-collector-config.yaml"]
    environment:
      LOG_LEVEL: debug

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.13.2
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
      - "9300:9300"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.13.2
    container_name: kibana
    depends_on:
      - elasticsearch
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"

carsonip · 2024-12-04T22:22:28Z

I tried these options but these are not meeting the expectations.
I can only see few of the logs being processed on retry. I am using file_storage for persistence storage.
But with this config, things can go wrong.

Sorry, do you mind elaborating on this? The config looks good to me.

When you say "few of the logs being processed on retry", do you mean others are dropped?
Do you have any numbers on how many logs are sent to collector, and how many are in ES?
Do you have a synthetic data set for me to reproduce the issue?
Are there any error logs?
Do you have visibility into the queue size? You should be able to check queue size with internal telemetry.
I see "sending_queue::num_consumers: 1" and "num_workers: 1". Under a high load, this is possibly too slow to index to ES, and will cause backpressure and fill up the queue. Try 10 for both of them.

amanmoar277 added enhancement New feature or request needs triage New item requiring triage labels Nov 28, 2024

github-actions bot added exporter/elasticsearch exporter/kafka receiver/kafka labels Nov 28, 2024

github-actions bot removed the needs triage New item requiring triage label Nov 28, 2024

github-actions bot mentioned this issue Dec 3, 2024

Weekly Report: 2024-11-26 - 2024-12-03 #36628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove log from Persistence storage only if exporter is able to consume the log and respond with 200 status code #36584

Remove log from Persistence storage only if exporter is able to consume the log and respond with 200 status code #36584

amanmoar277 commented Nov 28, 2024 •

edited

Loading

github-actions bot commented Nov 28, 2024

carsonip commented Nov 28, 2024

carsonip commented Nov 28, 2024

amanmoar277 commented Dec 2, 2024 •

edited

Loading

carsonip commented Dec 4, 2024

Remove log from Persistence storage only if exporter is able to consume the log and respond with 200 status code #36584

Remove log from Persistence storage only if exporter is able to consume the log and respond with 200 status code #36584

Comments

amanmoar277 commented Nov 28, 2024 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Nov 28, 2024

carsonip commented Nov 28, 2024

carsonip commented Nov 28, 2024

amanmoar277 commented Dec 2, 2024 • edited Loading

carsonip commented Dec 4, 2024

amanmoar277 commented Nov 28, 2024 •

edited

Loading

amanmoar277 commented Dec 2, 2024 •

edited

Loading