Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs not being flushed after x amount of time #2969

Closed
Xyrion opened this issue Apr 23, 2020 · 9 comments
Closed

Logs not being flushed after x amount of time #2969

Xyrion opened this issue Apr 23, 2020 · 9 comments
Labels

Comments

@Xyrion
Copy link

Xyrion commented Apr 23, 2020

Check CONTRIBUTING guideline first and here is the list to help us investigate the problem.

Describe the bug
I have an environment running Fluent-bit > Fluentd > Elasticsearch.
For a while my logs are flushed as they should, but after a while the logs stops being flushed and the buffer grows until the port gets blocked becuase of (overflow_action block).

What I have seen:
fluentd_output_status_num_errors is high for certain matches, however most matches have some errors.
Q: How do i view these errors? I dont see them in the logs even though I have tried with debug logging.

fluentd_output_status_buffer_queue_length is very high for two specific matches. These two matches have retry_max_interval 30, yet after a few hours they are still growing. These two matches are the only matches where I do an "include" only.

Q: How do i stop this from happening? Is there a way to see whats holding up the queue? I have checked the buffer path and the log which came in first is looking like a normal log which has been processed before.

To Reproduce

Expected behavior
The logs should constantly flow through, if there are any errors they should be printed in the logs.

Your Environment
Running in a container with:
repository: gcr.io/google-containers/fluentd-elasticsearch
tag: v2.4.0

Your Configuration

Here is my match block:
<match auth.keycloak.app>
   @type rewrite_tag_filter
   <rule>
     key log
     pattern /org.keycloak.events/
     tag keycloak.auth
   </rule>
 </match>

 <match keycloak.auth>
   @type rewrite_tag_filter
   <rule>
     key log
     pattern /type=(?<type>[^ ]+)(?<!,)/
     tag authorization.keycloak.$1
   </rule>
 </match>

    <match authorization.keycloak.**>
      @id elasticsearch-keycloak-authorization
      @type elasticsearch
      @log_level error
      log_es_400_reason true
      include_tag_key true
      host "#{ENV['OUTPUT_HOST']}"
      port "#{ENV['OUTPUT_PORT']}"
      scheme "#{ENV['OUTPUT_SCHEME']}"
      ssl_version "#{ENV['OUTPUT_SSL_VERSION']}"
      user client
      password "#{ENV['OUTPUT_PASSWORD']}"
      ssl_verify false
      logstash_format true
      logstash_prefix auth-keycloak
      <buffer>
        @type file
        flush_mode immediate
        path /var/log/fluentd-buffers/authorization-keycloak.buffer
        retry_type exponential_backoff
        flush_thread_count 2
        retry_limit 20
        retry_max_interval 30
        chunk_limit_size "#{ENV['OUTPUT_BUFFER_CHUNK_LIMIT']}"
        queue_limit_length "#{ENV['OUTPUT_BUFFER_QUEUE_LIMIT']}"
        overflow_action block
      </buffer>
    </match>
Any help is much appreciated!

Your Error Log
There is no error log! I see the problem through prometheus, buffer just keeps growing and does not drop until service has been redeployed or restarted.

<!-- Write your **ALL** error log here -->

Additional context

@Xyrion
Copy link
Author

Xyrion commented Apr 23, 2020

Please tell me if you want the full config. But I think I posted all the relevant parts.

@Xyrion
Copy link
Author

Xyrion commented Apr 23, 2020

Just found this:
2020-04-23 10:25:51 +0000 [warn]: #0 [elasticsearch-syslog] failed to flush the buffer. retry_time=9 next_retry_seconds=2020-04-23 10:26:23 +0000 chunk="5a3f2a6dbbe3f4401db5fd104f585ab1" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>"od-opendistro-es-client-service.logging.svc.cluster.local", :port=>9200, :scheme=>"https", :user=>"client", :password=>"obfuscated"}): read timeout reached"
2020-04-23 10:25:51 +0000 [warn]: #0 suppressed same stacktrace

How do I check the troublesome chunk?

@Xyrion
Copy link
Author

Xyrion commented Apr 23, 2020

Not seeing any fluentd_output_status_num_errors since I changed "overflow_action block" > "overflow_action throw_exception"

Anyone understand why this is the case?

@repeatedly
Copy link
Member

How do I check the troublesome chunk?

Logs includes chunk id: chunk="5a3f2a6dbbe3f4401db5fd104f585ab1"
But the problem seems your elasticsearch cluster or network, not chunk itself. > read timeout reached"
No errors in your elasticsearch cluster? read timeout ofthen happens when your cluster doesn't have enough capacity.

@Xyrion
Copy link
Author

Xyrion commented Apr 27, 2020

Thanks for your reply @repeatedly , I will have a look at what you mentioned with the read timeout reached. Another question: Are all those errors I see in prometheus related to this? Or is there a more direct way for me to check what (fluentd_output_status_num_errors) are related to?

@Xyrion
Copy link
Author

Xyrion commented May 14, 2020

Now I'm getting this:
2020-05-14 21:19:33 +0000 [error]: #0 [elasticsearch-kube] [Faraday::TimeoutError] read timeout reached {:host=>"xxxxxxxxxxxx, :port=>9200, :scheme=>"https", :user=>"xxxxxxx", :password=>"xxxxxxxxxxxx", :protocol=>"https"}

Config:

    <match **.namespace>
      @id elasticsearch-kube
      @type elasticsearch
      @log_level error
      emit_error_for_missing_id true
      with_transporter_log true
      reconnect_on_error true
      log_es_400_reason true
      include_tag_key true
      host "#{ENV['OUTPUT_HOST']}"
      port "#{ENV['OUTPUT_PORT']}"
      scheme "#{ENV['OUTPUT_SCHEME']}"
      ssl_version "#{ENV['OUTPUT_SSL_VERSION']}"
      user fluentd
      password "#{ENV['OUTPUT_PASSWORD']}"
      ssl_verify false
      logstash_format true
      logstash_prefix kube-namespaces
      <buffer>
        @type file
        flush_mode interval
        path /var/log/fluentd-buffers/kube-namespaces.buffer
        retry_type exponential_backoff
        flush_thread_count 2
        flush_interval 3s
        retry_limit 20
        retry_max_interval 30
        chunk_limit_size "#{ENV['OUTPUT_BUFFER_CHUNK_LIMIT']}"
        queue_limit_length "#{ENV['OUTPUT_BUFFER_QUEUE_LIMIT']}"
        overflow_action throw_exception
      </buffer>
    </match>

If i try to curl ES from another pod it works fine. Feels like there is some underlying cause which isn't getting printed out. Any ideas what I could do here @repeatedly .

I have tried:

Even though some of the suggested configs are default I set them anyway. Going crazy trying to debug something which doesent print enough usable logs. Any help is much appreciated!

@sunyl527
Copy link

sunyl527 commented Jun 1, 2020

I met the same issue. It seems that the buffer files size became over chunk_limit_size * queue_limit_length with overflow_action block,fluentd will block,if I change block to drop_oldest_chunk,logs being flushed.But I think it will lose some logs.Is there any better way to solve this issue?

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

@github-actions github-actions bot added the stale label Dec 18, 2020
@github-actions
Copy link

This issue was automatically closed because of stale in 30 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants