Improve monitoring reporter #8090

urso · 2018-08-26T14:19:17Z

Closes: #7966

Add backoff and failover support to the Elasticsearch monitoring
reporter.
The monitoring reporter runs in 2 phases. First phase it checks for
monitoring being enabled in Elasticsearch. The check runs every 30s.
If multiple hosts are configured, one host is selected by random.
Once phase 1 succeeds, phase 2 (collection phase) is started.

Before this change, phase 2 was configured to use load-balancing without
timeout if multiple hosts are configured. With events being dropped on
error and only one document being generated every 10s, this was ok in
most cases. Still, if one output is blocked, waiting for a long timeout
failover to another host can happen, even if no error occured yet.
If the failover host has errors, it might end up in a tight
reconnect-loop without any backoff behavior.
With recent changes to 6.4 beats creates a many more documents, which
was not taken into account in original design. Due to this misbehaving
monitoring outputs are much more likely:
=> Problems with reporter

Failover was not handled correctly
Creating more then one event and potentially spurious errors raise the need for backoff

This changes configures the clients to failover mode only. Whenever the
connection to one host fails, another host is selected by random.
On failure the reporters output will backoff exponentially. If the second client
(after failover) also fails, then the backoff waiting times are doubled.
And so on.

houndci-bot · 2018-08-26T14:19:28Z

libbeat/monitoring/report/elasticsearch/config.go

+	Backoff          Backoff           `config:"backoff"`
+}
+
+type Backoff struct {


exported type Backoff should have comment or be unexported

ruflin

I would also like to get @ycombinator comment on this one.

ruflin · 2018-08-27T12:52:42Z

@urso Will need rebase because of changelog fun.

ycombinator · 2018-08-27T13:54:54Z

@ruflin said:

I would also like to get @ycombinator comment on this one.

Yes, this PR is on my TODO list. Just getting back from holiday and plan to review it today.

ph

LGTM, I've looked the failover implementation and this look a good solution.

ycombinator · 2018-08-27T15:55:51Z

Should the xpack.monitoring.elasticsearch config section mention the new backoff settings?

ycombinator · 2018-08-27T15:59:54Z

libbeat/monitoring/report/elasticsearch/elasticsearch.go

+	outClient := clients[0]
+	if len(clients) > 1 {
+		outClient = outputs.NewFailoverClient(clients)
+	}


Why not just always initialize outClient as outputs.NewFailoverClient(clients)? Looking at the implementation of outputs.NewFailoverClient, it returns the first client from clients if len(clients) == 1 anyway:

beats/libbeat/outputs/failover.go

Lines 46 to 48 in e1d8c15

if len(clients) == 1 {

return clients[0]

}

Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.

urso · 2018-08-28T21:41:14Z

@ruflin update reference config files and docs. I noticed the period settings in the docs are out of date and just updated/added them as well. No idea if other settings are out of date in the docs.

ycombinator

LGTM. WFG.

ruflin · 2018-08-29T09:06:16Z

@urso Thanks for the update on the settings. No others were introduced as far as I remember.

Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on. (cherry picked from commit 43ee7d7)

Cherry-pick of PR #8090 to 6.4 branch. Original message: Closes: #7966 Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.

…#8144) Cherry-pick of PR elastic#8090 to 6.4 branch. Original message: Closes: elastic#7966 Add backoff and failover support to the Elasticsearch monitoring reporter. The monitoring reporter runs in 2 phases. First phase it checks for monitoring being enabled in Elasticsearch. The check runs every 30s. If multiple hosts are configured, one host is selected by random. Once phase 1 succeeds, phase 2 (collection phase) is started. Before this change, phase 2 was configured to use load-balancing without timeout if multiple hosts are configured. With events being dropped on error and only one document being generated every 10s, this was ok in most cases. Still, if one output is blocked, waiting for a long timeout failover to another host can happen, even if no error occured yet. If the failover host has errors, it might end up in a tight reconnect-loop without any backoff behavior. With recent changes to 6.4 beats creates a many more documents, which was not taken into account in original design. Due to this misbehaving monitoring outputs are much more likely: => Problems with reporter 1. Failover was not handled correctly 2. Creating more then one event and potentially spurious errors raise the need for backoff This changes configures the clients to failover mode only. Whenever the connection to one host fails, another host is selected by random. On failure the reporters output will backoff exponentially. If the second client (after failover) also fails, then the backoff waiting times are doubled. And so on.

houndci-bot reviewed Aug 26, 2018

View reviewed changes

urso requested review from ph and ycombinator August 26, 2018 14:59

ruflin approved these changes Aug 27, 2018

View reviewed changes

ph approved these changes Aug 27, 2018

View reviewed changes

ycombinator reviewed Aug 27, 2018

View reviewed changes

ph added the libbeat label Aug 27, 2018

ruflin added needs_backport PR is waiting to be backported to other branches. v6.5.0 v6.4.1 in progress Pull request is currently in progress. review labels Aug 28, 2018

urso added 6 commits August 28, 2018 23:39

Add changelog entry

0cdedde

Unexport backoff config

5af8919

review

ad801e1

Update reference config files

e6bede0

Update monitoring docs

7687d10

urso force-pushed the monitoring-es-backoff branch from e4f4103 to 7687d10 Compare August 28, 2018 21:40

urso removed the in progress Pull request is currently in progress. label Aug 28, 2018

ycombinator approved these changes Aug 28, 2018

View reviewed changes

ruflin approved these changes Aug 29, 2018

View reviewed changes

ph approved these changes Aug 29, 2018

View reviewed changes

urso merged commit 43ee7d7 into elastic:master Aug 29, 2018

graphaelli mentioned this pull request Aug 29, 2018

Update beats framework to 43ee7d7 elastic/apm-server#1338

Merged

urso mentioned this pull request Aug 29, 2018

Cherry-pick #8090 to 6.x: Improve monitoring reporter #8143

Merged

urso added v6.5.0 and removed needs_backport PR is waiting to be backported to other branches. labels Aug 29, 2018

urso mentioned this pull request Aug 29, 2018

Cherry-pick #8090 to 6.4: Improve monitoring reporter #8144

Merged

urso added the v6.4.1 label Aug 29, 2018

urso deleted the monitoring-es-backoff branch February 19, 2019 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve monitoring reporter #8090

Improve monitoring reporter #8090

urso commented Aug 26, 2018 •

edited

Loading

houndci-bot Aug 26, 2018

ruflin left a comment •

edited

Loading

ruflin commented Aug 27, 2018

ycombinator commented Aug 27, 2018

ph left a comment

ycombinator commented Aug 27, 2018 •

edited

Loading

ycombinator Aug 27, 2018

urso commented Aug 28, 2018

ycombinator left a comment

ruflin commented Aug 29, 2018

Improve monitoring reporter #8090

Improve monitoring reporter #8090

Conversation

urso commented Aug 26, 2018 • edited Loading

houndci-bot Aug 26, 2018

Choose a reason for hiding this comment

ruflin left a comment • edited Loading

Choose a reason for hiding this comment

ruflin commented Aug 27, 2018

ycombinator commented Aug 27, 2018

ph left a comment

Choose a reason for hiding this comment

ycombinator commented Aug 27, 2018 • edited Loading

ycombinator Aug 27, 2018

Choose a reason for hiding this comment

urso commented Aug 28, 2018

ycombinator left a comment

Choose a reason for hiding this comment

ruflin commented Aug 29, 2018

urso commented Aug 26, 2018 •

edited

Loading

ruflin left a comment •

edited

Loading

ycombinator commented Aug 27, 2018 •

edited

Loading