Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve monitoring reporter #8090

Merged
merged 6 commits into from
Aug 29, 2018
Merged

Improve monitoring reporter #8090

merged 6 commits into from
Aug 29, 2018

Conversation

urso
Copy link

@urso urso commented Aug 26, 2018

Closes: #7966

Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter

  1. Failover was not handled correctly
  2. Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.

Backoff Backoff `config:"backoff"`
}

type Backoff struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type Backoff should have comment or be unexported

@urso urso requested review from ph and ycombinator August 26, 2018 14:59
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to get @ycombinator comment on this one.

@ruflin
Copy link
Contributor

ruflin commented Aug 27, 2018

@urso Will need rebase because of changelog fun.

@ycombinator
Copy link
Contributor

@ruflin said:

I would also like to get @ycombinator comment on this one.

Yes, this PR is on my TODO list. Just getting back from holiday and plan to review it today.

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I've looked the failover implementation and this look a good solution.

@ycombinator
Copy link
Contributor

ycombinator commented Aug 27, 2018

Should the xpack.monitoring.elasticsearch config section mention the new backoff settings?

outClient := clients[0]
if len(clients) > 1 {
outClient = outputs.NewFailoverClient(clients)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just always initialize outClient as outputs.NewFailoverClient(clients)? Looking at the implementation of outputs.NewFailoverClient, it returns the first client from clients if len(clients) == 1 anyway:

if len(clients) == 1 {
return clients[0]
}

@ph ph added the libbeat label Aug 27, 2018
@ruflin ruflin added needs_backport PR is waiting to be backported to other branches. v6.5.0 v6.4.1 in progress Pull request is currently in progress. review labels Aug 28, 2018
urso added 6 commits August 28, 2018 23:39
Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter
1. Failover was not handled correctly
2. Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.
@urso urso force-pushed the monitoring-es-backoff branch from e4f4103 to 7687d10 Compare August 28, 2018 21:40
@urso
Copy link
Author

urso commented Aug 28, 2018

@ruflin update reference config files and docs. I noticed the period settings in the docs are out of date and just updated/added them as well. No idea if other settings are out of date in the docs.

@urso urso removed the in progress Pull request is currently in progress. label Aug 28, 2018
Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. WFG.

@ruflin
Copy link
Contributor

ruflin commented Aug 29, 2018

@urso Thanks for the update on the settings. No others were introduced as far as I remember.

@urso urso merged commit 43ee7d7 into elastic:master Aug 29, 2018
@urso urso added v6.5.0 and removed needs_backport PR is waiting to be backported to other branches. labels Aug 29, 2018
urso pushed a commit to urso/beats that referenced this pull request Aug 29, 2018
Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter
1. Failover was not handled correctly
2. Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.

(cherry picked from commit 43ee7d7)
@urso urso added the v6.4.1 label Aug 29, 2018
urso pushed a commit to urso/beats that referenced this pull request Aug 29, 2018
Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter
1. Failover was not handled correctly
2. Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.

(cherry picked from commit 43ee7d7)
urso pushed a commit that referenced this pull request Aug 30, 2018
Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter
1. Failover was not handled correctly
2. Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.

(cherry picked from commit 43ee7d7)
urso pushed a commit that referenced this pull request Aug 30, 2018
Cherry-pick of PR #8090 to 6.4 branch. Original message: 

Closes: #7966 

Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter
1. Failover was not handled correctly
2. Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.
@urso urso deleted the monitoring-es-backoff branch February 19, 2019 18:30
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…#8144)

Cherry-pick of PR elastic#8090 to 6.4 branch. Original message: 

Closes: elastic#7966 

Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter
1. Failover was not handled correctly
2. Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants