-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve monitoring reporter #8090
Conversation
Backoff Backoff `config:"backoff"` | ||
} | ||
|
||
type Backoff struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported type Backoff should have comment or be unexported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to get @ycombinator comment on this one.
@urso Will need rebase because of changelog fun. |
@ruflin said:
Yes, this PR is on my TODO list. Just getting back from holiday and plan to review it today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I've looked the failover implementation and this look a good solution.
Should the |
outClient := clients[0] | ||
if len(clients) > 1 { | ||
outClient = outputs.NewFailoverClient(clients) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just always initialize outClient
as outputs.NewFailoverClient(clients)
? Looking at the implementation of outputs.NewFailoverClient
, it returns the first client from clients
if len(clients) == 1
anyway:
beats/libbeat/outputs/failover.go
Lines 46 to 48 in e1d8c15
if len(clients) == 1 { | |
return clients[0] | |
} |
Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.
e4f4103
to
7687d10
Compare
@ruflin update reference config files and docs. I noticed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. WFG.
@urso Thanks for the update on the settings. No others were introduced as far as I remember. |
Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on. (cherry picked from commit 43ee7d7)
Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on. (cherry picked from commit 43ee7d7)
Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on. (cherry picked from commit 43ee7d7)
Cherry-pick of PR #8090 to 6.4 branch. Original message: Closes: #7966 Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.
…#8144) Cherry-pick of PR elastic#8090 to 6.4 branch. Original message: Closes: elastic#7966 Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.
Closes: #7966
Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.
Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter
This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.