Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluentd v0.14: "Unexpected error raised. Stopping the timer." on heartbeat timeout #1596

Closed
notslang opened this issue Jun 13, 2017 · 8 comments · Fixed by #1602
Closed

fluentd v0.14: "Unexpected error raised. Stopping the timer." on heartbeat timeout #1596

notslang opened this issue Jun 13, 2017 · 8 comments · Fixed by #1602
Labels
moreinfo Missing version, need reproducible steps, need to investigate more

Comments

@notslang
Copy link

It seems like a timeout during a heartbeat caused the Unexpected error raised. Stopping the timer. part, but rather than trying to reconnect to 192.168.1.202, it just keeps failing with no nodes are available. I've checked that fluentd on 192.168.1.202 is running and even restarted it, but the fluentd instance that is forwarding logs to it refuses to reconnect.

2017-06-13 08:59:51 +0000 [warn]: #0 failed to flush the buffer. retry_time=3 next_retry_seconds=2017-06-13 08:59:51 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 08:59:51 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 08:59:58 +0000 [warn]: #0 failed to flush the buffer. retry_time=4 next_retry_seconds=2017-06-13 08:59:58 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 08:59:58 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:00:13 +0000 [warn]: #0 failed to flush the buffer. retry_time=5 next_retry_seconds=2017-06-13 09:00:13 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:00:13 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:00:48 +0000 [warn]: #0 failed to flush the buffer. retry_time=6 next_retry_seconds=2017-06-13 09:00:48 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:00:48 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:00:48 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:out_forward_heartbeat_request error_class=Errno::ETIMEDOUT error="Operation timed out - connect(2) for \"192.168.1.202\" port 24224"
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/socket.rb:59:in `initialize'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/socket.rb:59:in `new'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/socket.rb:59:in `socket_create_tcp'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:334:in `create_transfer_socket'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:662:in `send_heartbeat'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:404:in `block in on_timer'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:397:in `each'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin/out_forward.rb:397:in `on_timer'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/timer.rb:77:in `on_timer'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/cool.io-1.5.0/lib/cool.io/loop.rb:88:in `run_once'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/cool.io-1.5.0/lib/cool.io/loop.rb:88:in `run'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/event_loop.rb:84:in `block in start'
  2017-06-13 09:00:48 +0000 [error]: #0 /usr/lib/ruby/gems/2.3.0/gems/fluentd-0.14.17/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-06-13 09:00:48 +0000 [error]: #0 Timer detached. title=:out_forward_heartbeat_request
2017-06-13 09:01:55 +0000 [warn]: #0 failed to flush the buffer. retry_time=7 next_retry_seconds=2017-06-13 09:01:55 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:01:55 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:03:49 +0000 [warn]: #0 failed to flush the buffer. retry_time=8 next_retry_seconds=2017-06-13 09:03:49 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:03:49 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:08:02 +0000 [warn]: #0 failed to flush the buffer. retry_time=9 next_retry_seconds=2017-06-13 09:08:02 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:08:02 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:16:51 +0000 [warn]: #0 failed to flush the buffer. retry_time=10 next_retry_seconds=2017-06-13 09:16:51 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:16:51 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:26:55 +0000 [warn]: #0 failed to flush the buffer. retry_time=11 next_retry_seconds=2017-06-13 09:26:55 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:26:55 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:37:39 +0000 [warn]: #0 failed to flush the buffer. retry_time=12 next_retry_seconds=2017-06-13 09:37:39 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:37:39 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:47:19 +0000 [warn]: #0 failed to flush the buffer. retry_time=13 next_retry_seconds=2017-06-13 09:47:19 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:47:19 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 09:56:13 +0000 [warn]: #0 failed to flush the buffer. retry_time=14 next_retry_seconds=2017-06-13 09:56:13 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 09:56:13 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 10:05:13 +0000 [warn]: #0 failed to flush the buffer. retry_time=15 next_retry_seconds=2017-06-13 10:05:13 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 10:05:13 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 10:15:12 +0000 [warn]: #0 failed to flush the buffer. retry_time=16 next_retry_seconds=2017-06-13 10:15:12 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 10:15:12 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 10:26:08 +0000 [warn]: #0 failed to flush the buffer. retry_time=17 next_retry_seconds=2017-06-13 10:26:08 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 10:26:08 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 10:36:04 +0000 [warn]: #0 failed to flush the buffer. retry_time=18 next_retry_seconds=2017-06-13 10:36:04 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 10:36:04 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 10:46:12 +0000 [warn]: #0 failed to flush the buffer. retry_time=19 next_retry_seconds=2017-06-13 10:46:12 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 10:46:12 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 10:55:49 +0000 [warn]: #0 failed to flush the buffer. retry_time=20 next_retry_seconds=2017-06-13 10:55:49 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 10:55:49 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 11:04:36 +0000 [warn]: #0 failed to flush the buffer. retry_time=21 next_retry_seconds=2017-06-13 11:04:36 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 11:04:36 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 11:15:35 +0000 [warn]: #0 failed to flush the buffer. retry_time=22 next_retry_seconds=2017-06-13 11:15:35 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 11:15:35 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 11:25:17 +0000 [warn]: #0 failed to flush the buffer. retry_time=23 next_retry_seconds=2017-06-13 11:25:17 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 11:25:17 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 11:34:38 +0000 [warn]: #0 failed to flush the buffer. retry_time=24 next_retry_seconds=2017-06-13 11:34:38 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 11:34:38 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 11:45:09 +0000 [warn]: #0 failed to flush the buffer. retry_time=25 next_retry_seconds=2017-06-13 11:45:09 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 11:45:09 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 11:53:58 +0000 [warn]: #0 failed to flush the buffer. retry_time=26 next_retry_seconds=2017-06-13 11:53:58 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 11:53:58 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 12:03:13 +0000 [warn]: #0 failed to flush the buffer. retry_time=27 next_retry_seconds=2017-06-13 12:03:13 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 12:03:13 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 12:12:11 +0000 [warn]: #0 failed to flush the buffer. retry_time=28 next_retry_seconds=2017-06-13 12:12:11 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 12:12:11 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 12:21:38 +0000 [warn]: #0 failed to flush the buffer. retry_time=29 next_retry_seconds=2017-06-13 12:21:38 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
  2017-06-13 12:21:38 +0000 [warn]: #0 suppressed same stacktrace
2017-06-13 12:30:36 +0000 [warn]: #0 failed to flush the buffer. retry_time=30 next_retry_seconds=2017-06-13 12:30:36 +0000 chunk="551d1b477b6d9624b2c8ce4f7729f59b" error_class=Fluent::Plugin::ForwardOutput::NoNodesAvailable error="no nodes are available"
...

config of the forwarder:

<source>
  @type  forward
  @id    input1
  #@label @mainstream
  port  24224
</source>

<match **>
  @type forward
  send_timeout 60s
  recover_wait 10s
  heartbeat_interval 1s
  phi_threshold 16
  hard_timeout 60s
  require_ack_response true
  compress gzip

  <server>
    name coreos0
    host 192.168.1.202
    port 24224
    weight 60
  </server>

  <buffer>
    @type file
    path /fluentd/log/buffer
    retry_max_interval 10m
    retry_forever true
    compress gzip
  </buffer>
</match>
  • fluentd version: Docker image fluentd:v0.14 with image id 860efe232d88
  • Environment: Docker running on Arch
@repeatedly repeatedly added the moreinfo Missing version, need reproducible steps, need to investigate more label Jun 19, 2017
@repeatedly
Copy link
Member

Error said "Errno::ETIMEDOUT". You need to investigate why this error happen on your network environemnt.

@mpas
Copy link

mpas commented Jun 20, 2017

Experiencing the same issue, when in a HA setup the aggregator nodes are unavailable the error is raised and Timer is stopped.

2017-06-20 14:45:34 +0000 [warn]: #0 no response from node. regard it as unavailable. host="log-aggregator-1.dev.l" port=24224
2017-06-20 14:45:50 +0000 [warn]: #0 recovered forwarding server 'aggregator-1' host="log-aggregator-1.dev.l" port=24224
2017-06-20 15:26:48 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:out_forward_heartbeat_request error_class=Errno::ETIMEDOUT error="Operation timed out - connect(2) for "10.0.2.206" port 24224"

@repeatedly
Copy link
Member

Hmm... one way is ignoring Errno::ETIMEDOUT in heartbeat timer.

diff --git a/lib/fluent/plugin/out_forward.rb b/lib/fluent/plugin/out_forward.rb
index 712447b..aa2b32c 100644
--- a/lib/fluent/plugin/out_forward.rb
+++ b/lib/fluent/plugin/out_forward.rb
@@ -402,7 +402,7 @@ module Fluent::Plugin
           log.trace "sending heartbeat", host: n.host, port: n.port, heartbeat_type: @heartbeat_type
           n.usock = @usock if @usock
           n.send_heartbeat
-        rescue Errno::EAGAIN, Errno::EWOULDBLOCK, Errno::EINTR, Errno::ECONNREFUSED
+        rescue Errno::EAGAIN, Errno::EWOULDBLOCK, Errno::EINTR, Errno::ECONNREFUSED, Errno::ETIMEDOUT
           log.debug "failed to send heartbeat packet", host: n.host, port: n.port, heartbeat_type: @heartbeat_type, error: $!
         end
       }

@notslang
Copy link
Author

notslang commented Jun 20, 2017 via email

@repeatedly
Copy link
Member

Patch: #1602

@repeatedly
Copy link
Member

Release v0.14.18 with the fix.

@notslang
Copy link
Author

Hmm... one way is ignoring Errno::ETIMEDOUT in heartbeat timer.

That sounds good, but should any network error cause the heartbeat to stop? Even if I unplug a machine from the network entirely, and end up with ENETUNREACH or EPIPE errors, I'd still like fluentd to keep trying to reconnect so I don't have to manually restart it when the machine comes back online.

Or, at least fluentd should exit with a non-zero exit code so Kubernetes / Docker can restart it for me.

@repeatedly
Copy link
Member

repeatedly commented Jun 24, 2017

That sounds good, but should any network error cause the heartbeat to stop?

No, see patch. Continue to work.

Or, at least fluentd should exit with a non-zero exit code

Fluentd's worker exit with 1 for unexpected error / 2 for unrecoverable error, e.g. configuration error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
moreinfo Missing version, need reproducible steps, need to investigate more
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants