Logs not being flushed after x amount of time #2969

Xyrion · 2020-04-23T08:21:36Z

Check CONTRIBUTING guideline first and here is the list to help us investigate the problem.

Describe the bug
I have an environment running Fluent-bit > Fluentd > Elasticsearch.
For a while my logs are flushed as they should, but after a while the logs stops being flushed and the buffer grows until the port gets blocked becuase of (overflow_action block).

What I have seen:
fluentd_output_status_num_errors is high for certain matches, however most matches have some errors.
Q: How do i view these errors? I dont see them in the logs even though I have tried with debug logging.

fluentd_output_status_buffer_queue_length is very high for two specific matches. These two matches have retry_max_interval 30, yet after a few hours they are still growing. These two matches are the only matches where I do an "include" only.

Q: How do i stop this from happening? Is there a way to see whats holding up the queue? I have checked the buffer path and the log which came in first is looking like a normal log which has been processed before.

To Reproduce

Expected behavior
The logs should constantly flow through, if there are any errors they should be printed in the logs.

Your Environment
Running in a container with:
repository: gcr.io/google-containers/fluentd-elasticsearch
tag: v2.4.0

Your Configuration

Here is my match block:
<match auth.keycloak.app>
   @type rewrite_tag_filter
   <rule>
     key log
     pattern /org.keycloak.events/
     tag keycloak.auth
   </rule>
 </match>

 <match keycloak.auth>
   @type rewrite_tag_filter
   <rule>
     key log
     pattern /type=(?<type>[^ ]+)(?<!,)/
     tag authorization.keycloak.$1
   </rule>
 </match>

    <match authorization.keycloak.**>
      @id elasticsearch-keycloak-authorization
      @type elasticsearch
      @log_level error
      log_es_400_reason true
      include_tag_key true
      host "#{ENV['OUTPUT_HOST']}"
      port "#{ENV['OUTPUT_PORT']}"
      scheme "#{ENV['OUTPUT_SCHEME']}"
      ssl_version "#{ENV['OUTPUT_SSL_VERSION']}"
      user client
      password "#{ENV['OUTPUT_PASSWORD']}"
      ssl_verify false
      logstash_format true
      logstash_prefix auth-keycloak
      <buffer>
        @type file
        flush_mode immediate
        path /var/log/fluentd-buffers/authorization-keycloak.buffer
        retry_type exponential_backoff
        flush_thread_count 2
        retry_limit 20
        retry_max_interval 30
        chunk_limit_size "#{ENV['OUTPUT_BUFFER_CHUNK_LIMIT']}"
        queue_limit_length "#{ENV['OUTPUT_BUFFER_QUEUE_LIMIT']}"
        overflow_action block
      </buffer>
    </match>
Any help is much appreciated!

Your Error Log
There is no error log! I see the problem through prometheus, buffer just keeps growing and does not drop until service has been redeployed or restarted.

<!-- Write your **ALL** error log here -->

Additional context

The text was updated successfully, but these errors were encountered:

Xyrion · 2020-04-23T08:22:19Z

Please tell me if you want the full config. But I think I posted all the relevant parts.

Xyrion · 2020-04-23T10:33:37Z

Just found this:
2020-04-23 10:25:51 +0000 [warn]: #0 [elasticsearch-syslog] failed to flush the buffer. retry_time=9 next_retry_seconds=2020-04-23 10:26:23 +0000 chunk="5a3f2a6dbbe3f4401db5fd104f585ab1" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>"od-opendistro-es-client-service.logging.svc.cluster.local", :port=>9200, :scheme=>"https", :user=>"client", :password=>"obfuscated"}): read timeout reached"
2020-04-23 10:25:51 +0000 [warn]: #0 suppressed same stacktrace

How do I check the troublesome chunk?

Xyrion · 2020-04-23T11:43:36Z

Not seeing any fluentd_output_status_num_errors since I changed "overflow_action block" > "overflow_action throw_exception"

Anyone understand why this is the case?

repeatedly · 2020-04-24T21:21:17Z

How do I check the troublesome chunk?

Logs includes chunk id: chunk="5a3f2a6dbbe3f4401db5fd104f585ab1"
But the problem seems your elasticsearch cluster or network, not chunk itself. > read timeout reached"
No errors in your elasticsearch cluster? read timeout ofthen happens when your cluster doesn't have enough capacity.

Xyrion · 2020-04-27T07:22:08Z

Thanks for your reply @repeatedly , I will have a look at what you mentioned with the read timeout reached. Another question: Are all those errors I see in prometheus related to this? Or is there a more direct way for me to check what (fluentd_output_status_num_errors) are related to?

Xyrion · 2020-05-14T21:27:05Z

Now I'm getting this:
2020-05-14 21:19:33 +0000 [error]: #0 [elasticsearch-kube] [Faraday::TimeoutError] read timeout reached {:host=>"xxxxxxxxxxxx, :port=>9200, :scheme=>"https", :user=>"xxxxxxx", :password=>"xxxxxxxxxxxx", :protocol=>"https"}

Config:

    <match **.namespace>
      @id elasticsearch-kube
      @type elasticsearch
      @log_level error
      emit_error_for_missing_id true
      with_transporter_log true
      reconnect_on_error true
      log_es_400_reason true
      include_tag_key true
      host "#{ENV['OUTPUT_HOST']}"
      port "#{ENV['OUTPUT_PORT']}"
      scheme "#{ENV['OUTPUT_SCHEME']}"
      ssl_version "#{ENV['OUTPUT_SSL_VERSION']}"
      user fluentd
      password "#{ENV['OUTPUT_PASSWORD']}"
      ssl_verify false
      logstash_format true
      logstash_prefix kube-namespaces
      <buffer>
        @type file
        flush_mode interval
        path /var/log/fluentd-buffers/kube-namespaces.buffer
        retry_type exponential_backoff
        flush_thread_count 2
        flush_interval 3s
        retry_limit 20
        retry_max_interval 30
        chunk_limit_size "#{ENV['OUTPUT_BUFFER_CHUNK_LIMIT']}"
        queue_limit_length "#{ENV['OUTPUT_BUFFER_QUEUE_LIMIT']}"
        overflow_action throw_exception
      </buffer>
    </match>

If i try to curl ES from another pod it works fine. Feels like there is some underlying cause which isn't getting printed out. Any ideas what I could do here @repeatedly .

I have tried:

Even though some of the suggested configs are default I set them anyway. Going crazy trying to debug something which doesent print enough usable logs. Any help is much appreciated!

sunyl527 · 2020-06-01T02:29:29Z

I met the same issue. It seems that the buffer files size became over chunk_limit_size * queue_limit_length with overflow_action block,fluentd will block,if I change block to drop_oldest_chunk,logs being flushed.But I think it will lose some logs.Is there any better way to solve this issue?

github-actions · 2020-12-18T10:11:19Z

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions · 2021-01-17T10:36:34Z

This issue was automatically closed because of stale in 30 days

himanshigpta mentioned this issue May 7, 2020

Td-agent randomly stops sending logs to Elasticsearch #2981

Closed

github-actions bot added the stale label Dec 18, 2020

github-actions bot closed this as completed Jan 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logs not being flushed after x amount of time #2969

Logs not being flushed after x amount of time #2969

Xyrion commented Apr 23, 2020

Xyrion commented Apr 23, 2020

Xyrion commented Apr 23, 2020

Xyrion commented Apr 23, 2020

repeatedly commented Apr 24, 2020

Xyrion commented Apr 27, 2020

Xyrion commented May 14, 2020 •

edited

Loading

sunyl527 commented Jun 1, 2020

github-actions bot commented Dec 18, 2020

github-actions bot commented Jan 17, 2021

Logs not being flushed after x amount of time #2969

Logs not being flushed after x amount of time #2969

Comments

Xyrion commented Apr 23, 2020

Xyrion commented Apr 23, 2020

Xyrion commented Apr 23, 2020

Xyrion commented Apr 23, 2020

repeatedly commented Apr 24, 2020

Xyrion commented Apr 27, 2020

Xyrion commented May 14, 2020 • edited Loading

sunyl527 commented Jun 1, 2020

github-actions bot commented Dec 18, 2020

github-actions bot commented Jan 17, 2021

Xyrion commented May 14, 2020 •

edited

Loading