Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency in Elastic stack metricsets' code #8308

Merged
merged 30 commits into from
Oct 25, 2018
Merged

Consistency in Elastic stack metricsets' code #8308

merged 30 commits into from
Oct 25, 2018

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Sep 13, 2018

This PR makes error messages, reporting, and logging consistent across all metricsets for Elastic stack products, including the xpack monitoring code paths:

  • elasticsearch/ccr
  • elasticsearch/cluster_stats
  • elasticsearch/index
  • elasticsearch/index_recovery
  • elasticsearch/index_summary
  • elasticsearch/ml_job
  • elasticsearch/node
  • elasticsearch/node_stats
  • elasticsearch/pending_tasks
  • elasticsearch/shard
  • kibana/status
  • kibana/stats
  • logstash/node
  • logstash/node_stats

@ycombinator ycombinator added in progress Pull request is currently in progress. Metricbeat Metricbeat needs_backport PR is waiting to be backported to other branches. v7.0.0-alpha1 v6.5.0 labels Sep 13, 2018
@@ -57,25 +56,35 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) {
func (m *MetricSet) Fetch(r mb.ReporterV2) {
isMaster, err := elasticsearch.IsMaster(m.HTTP, m.HostData().SanitizedURI+clusterStatsPath)
if err != nil {
r.Error(fmt.Errorf("Error fetching master info: %s", err))
msg := errors.Wrap(err, "error determining if connected Elasticsearch node is master.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors strings should not end with punctuation. https://github.com/golang/go/wiki/CodeReviewComments#error-strings.

return &MetricSet{
base,
http,
config.XPack,
log,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might as will inline the logp.NewLogger statement.

@@ -92,11 +92,15 @@ func (m *MetricSet) Fetch(r mb.ReporterV2) {
}

if m.MetricSet.XPack {
eventsMappingXPack(r, m, content)
err = eventsMappingXPack(r, m, content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is never reported. Is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we don't want to report X-Pack errors because they would end up in the metricbeat-* index, not the .monitoring-* index which is where all the "success" documents go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously this is something we want to change at some point, but for now we're just trying to achieve parity with existing documents in .monitoring-*.

} else {
err = eventsMapping(r, content)
if err != nil {
r.Error(err)
}
}

if err != nil {
m.Log.Error(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the reporting and logging handled separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly because we want to log x-pack and non-xpack errors but want to report only non-xpack errors. I agree, it's not very pleasant.

@@ -20,6 +20,11 @@ package index_recovery
import (
"encoding/json"

"github.com/elastic/beats/metricbeat/helper/elastic"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import should be grouped with the other github.com/elastic imports.

err = eventsMapping(r, *info, content)
if err != nil {
r.Error(err)
m.Log.Error(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a common pattern. I wonder if it would make sense to have the Metricbeat framework log any error that is reported by a metricset.

Copy link
Contributor Author

@ycombinator ycombinator Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering this too but I have a fairly narrow and relatively recent view in Metricbeat. I'll defer to folks who've worked more broadly in it than me for their opinion: @jsoriano @ruflin, your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would definitively nice to make this more generic.

Best in a follow up PR.

return
}

if m.MetricSet.XPack {
eventMappingXPack(r, m, content)
err = eventMappingXPack(r, m, content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is never reported. Is that intentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, xpack data structure cannot make sense of the errors.

@@ -32,6 +32,9 @@ import (
// Global clusterIdCache. Assumption is that the same node id never can belong to a different cluster id
var clusterIDCache = map[string]string{}

// ModuleName is the ame of this module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ame/name/

And godocs should have proper punctuation. Wouldn't it be nicer if the descriptions here ended in punctuation. 😄

@@ -58,16 +59,14 @@ func eventsMapping(r mb.ReporterV2, info elasticsearch.Info, content []byte) err

err := json.Unmarshal(content, &indicesStruct)
if err != nil {
r.Error(err)
return err
return errors.Wrap(err, "failure parsing Elasticsearch Stats API response")
}

var errors multierror.Errors
for name, index := range indicesStruct.Indices {
event := mb.Event{}
event.MetricSetFields, err = schema.Apply(index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event can also contain an error. I think it would make sense to store/report the error alongside the data that it's associated with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is now solved with the more resent commits?

@ycombinator
Copy link
Contributor Author

Good stuff, @andrewkroh. Appreciate the early feedback so I don't have to make the same fixes in a number of places later 😄!

@ycombinator ycombinator changed the title Consistency in Elastic stack metricsets' error handling Consistency in Elastic stack metricsets' code Sep 18, 2018
@ycombinator
Copy link
Contributor Author

CI is failing because of unrelated flaky test: #8208

@ycombinator
Copy link
Contributor Author

ycombinator commented Sep 29, 2018

@ruflin @andrewkroh The CI intake job is failing with this error:

18:08:40 ./module/elasticsearch/elasticsearch_integration_test.go
18:08:40 Code differs from goimports' style ^
18:08:40 ../libbeat/scripts/Makefile:115: recipe for target 'check' failed

I checked the imports in elasticsearch_integration_test.go and I'm not seeing anything odd. I didn't even mess with the imports in that file in is PR 😄 Could you help me figure out what I'm missing? Thanks.

@ruflin
Copy link
Contributor

ruflin commented Oct 1, 2018

Can you try to run make fmt to see if you get a diff?

ycombinator added a commit that referenced this pull request Oct 1, 2018
…#8497)

This PR makes the Metricbeat `logstash` module code consistent with other Elastic stack modules by:

* using `FetchContent()` instead of `FetchJSON()` to fetch data from LS APIs before passing it to the event mapping function.
* adding unit tests.

Further consistency around event reporting and error handling will be achieved in #8308.
@ycombinator
Copy link
Contributor Author

Can you try to run make fmt to see if you get a diff?

@ruflin I did get a diff but it was in a file completely unrelated to this PR or the error shown by the intake job in CI. Still I tried to commit that diff in da30d94 but it didn't help either.

@ruflin
Copy link
Contributor

ruflin commented Oct 1, 2018

Which go version do you have locally?

@ycombinator
Copy link
Contributor Author

1.11 😬 Let me try downgrading to 1.10.

@ruflin
Copy link
Contributor

ruflin commented Oct 1, 2018

Yes, 1.10 will do different indentations I learned today.

@ycombinator ycombinator added review and removed in progress Pull request is currently in progress. labels Oct 2, 2018
@ycombinator ycombinator requested a review from ruflin October 2, 2018 11:12
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see much more user friendly error messages. Very nice to have everything on v2Reporter now. Thanks for doing all this cleanup. Left some minor comments.

The discussion around error reporting is more general for modules / metricbeat not only the stack modules. We can also delay it.

return
}

if m.MetricSet.XPack {
eventMappingXPack(r, m, content)
err = eventMappingXPack(r, m, content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, xpack data structure cannot make sense of the errors.

for name, index := range indicesStruct.Indices {
event := mb.Event{}
event.MetricSetFields, err = schema.Apply(index)
if err != nil {
r.Error(err)
errors = append(errors, err)
errs = append(errs, errors.Wrap(err, "failure applying index schema"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the old behaviour that we have 1 error per index. +1 on what andrew suggestions above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to reporting 1 error per event, but only in the non-xpack code paths (since reporting errors into xpack indices doesn't make sense, at least today). I will fix all non-xpack code paths in this PR to go back to reporting 1 error per event.

eventsMappingXPack(r, m, *info, content)
err = eventsMappingXPack(r, m, *info, content)
if err != nil {
m.Log.Error(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should make debugging easier. I wonder if we should do it really on the Error level or only Info or Debug. Reasons is that I assume also when Elasticsearch is down, an Error is logged but it's not an error in the sense of the Error log I would argue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely don't think Info, because then it would get logged all the time? I can see the argument both ways for Error vs. Debug. I'd rather keep it at Error because we only log it if something has gone wrong. Whereas Debug logging tends to happen regardless, just to provide additional detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would info get logged all the time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might've misunderstood. When you suggested using Info, I thought you meant something like this:

err = eventsMappingXPack(r, m, *info, content)
m.Log.Info(err)

Notice, no if check.

But I guess we could still log at Info level while keeping the if check. Still, feels a bit odd to me that we'd log errors at Info level. err could contain "fake" errors like ES being down but err could also contain a real error like a parsing error that happened inside eventsMappingXPack.

@@ -38,23 +41,23 @@ func eventsMappingXPack(r mb.ReporterV2, m *MetricSet, content []byte) error {
for indexName, indexData := range data {
indexData, ok := indexData.(map[string]interface{})
if !ok {
continue
return fmt.Errorf("%v is not a map", indexName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behaviour. Should we really skip all metrics if one is wrong? Same for the things below.

It is probably unlikely that one is wrong and the others are not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking with the new approach is:

  1. It's unlikely that one is wrong and others are not, and also
  2. In the case when some data might be erroneous, is it better to make a best effort or is it better to fail fast? I decided to go with the latter because I think it makes it easier to spot an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way of looking at it: What if the data inside one index object is wiered and all the others are ok. It would mean we don't ship all the valid index data. Perhaps the solution is somewhere in between that we send just 1 error doc but still send the valid ones (I think you did that further down).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move forward with the way it is at the moment and fix it if it breaks.

}

nodeInfo, err := elasticsearch.GetNodeInfo(m.HTTP, m.HostData().SanitizedURI+statsPath, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this code obsolete or is that the one that will be fixed in an other PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's obsolete because the monitoring UI does not need source_node for indices_stats documents.

@@ -58,11 +60,12 @@ func eventsMapping(r mb.ReporterV2, content []byte) error {

event.MetricSetFields, err = schema.Apply(job)
if err != nil {
errs = append(errs, err)
errs = append(errs, errors.Wrap(err, "failure applying ml job schema"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behaviour as in case of an error, still an incomplete event was sent. We should discuss what our general expectation in modules here are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -120,7 +121,8 @@ func eventsMapping(r mb.ReporterV2, content []byte) error {

event.MetricSetFields, err = schema.Apply(node)
if err != nil {
r.Error(err)
errs = append(errs, errors.Wrap(err, "failure to apply node schema"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in behaviour. Instead of 5 errors now only 1 overall error is sent. We should discuss what our expectation here is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

dataFields, err := schema.Apply(data)
if err != nil {
r.Error(err)
return errors.Wrap(err, "failure to apply stats schema")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ties into the discussion above on what our expecation on error is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, same as https://github.com/elastic/beats/pull/8308/files#r222085637. For non-xpack code paths (like this one here) I will update the PR to report one error per event.

)

func TestFetch(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is removed because of the reporter change. But I was not sure where this is covered now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I removed this because Fetch no longer returns an event. And now this file is consistent with other Elastic stack integration test files; they also only have a TestData function that ultimately exercises Fetch via the mbtest.WriteEventsReporterV2 call in the test code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestData is not run by default in our tests. What I did for the Elasticsearch module you can find here: https://github.com/elastic/beats/blob/master/metricbeat/module/elasticsearch/elasticsearch_integration_test.go It basically imports all the metricsets with a fetcher and then runs the tests automatically. We should this here too for all the ones where TestFetch was removed.

@ycombinator
Copy link
Contributor Author

ycombinator commented Oct 2, 2018

@ruflin @andrewkroh I've attempted to address your review feedback around error reporting with events by making changes just to the elasticsearch/ccr metricset in 928d27b. Please take a look at this commit and let me know what you think. Once we agree on the pattern with this metricset, I'll apply the same pattern to other metricsets in this PR as well.

Specifically, I've tried to establish the following pattern in this sample commit:

  1. For non-x-pack event parsing, report errors via the reporter v2 API. If the error is applicable to a single event, report it via the event's Error field.
  2. For x-pack and non-x-pack event parsing alike, log errors.

@ruflin
Copy link
Contributor

ruflin commented Oct 3, 2018

Normally I would ask you to split up this PR into multiple PR's. I think there are lots of parts that need no discussion and could be directly merged, like the additional log messages you added or introducing the logger. Like this a big part of the changes could already be merged and the discussion would be only holding back a small part of the PR. At the same time I see it would be quite a bit of work to extract the non controversial parts now.

ycombinator added a commit that referenced this pull request Oct 3, 2018
…#8497) (#8525)

This PR makes the Metricbeat `logstash` module code consistent with other Elastic stack modules by:

* using `FetchContent()` instead of `FetchJSON()` to fetch data from LS APIs before passing it to the event mapping function.
* adding unit tests.

Further consistency around event reporting and error handling will be achieved in #8308.

(cherry picked from commit 909d789)
@ycombinator
Copy link
Contributor Author

@ruflin Initially this PR started out small but then it grew because at the time of development I wasn't sure how far these changes will cascade. But now that all metricsets are done and we've identified the parts we like (e.g. additional log messages + logger) I'm happy to extract out those parts into a separate PR to keep things moving. That will also make the controversial parts easier to discuss.

So I'm going to keep this PR open for the controversial parts, since there is discussion history building up here. But I will put up a separate PR(s) that will only handle better log messages and the logger. Let's get those "easy" PR(s) merged first, then come back to this PR (after rebasing it, etc.) for further discussion?

@ycombinator
Copy link
Contributor Author

Per my previous comment, I've teased out just the error message consistency improvements into their own PR: #8551.

@ycombinator ycombinator added in progress Pull request is currently in progress. and removed review labels Oct 4, 2018
@ycombinator
Copy link
Contributor Author

@ruflin Other PR (#8592) is merged into master and this PR has been rebased on master now. Ready for your 👀 again.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the error logging / reporting model we have now in the Elasticsearch and Kibana Module. We should probably make that our standard when changing other modules.

event.MetricSetFields, err = schema.Apply(shard)
if err != nil {
errors = append(errors, err)
event.Error = errors.Wrap(err, "failure applying shard schema")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we report partial documents events even on failure. We should probably document this somewhere.

@@ -58,16 +59,14 @@ func eventsMapping(r mb.ReporterV2, info elasticsearch.Info, content []byte) err

err := json.Unmarshal(content, &indicesStruct)
if err != nil {
r.Error(err)
return err
return errors.Wrap(err, "failure parsing Elasticsearch Stats API response")
}

var errors multierror.Errors
for name, index := range indicesStruct.Indices {
event := mb.Event{}
event.MetricSetFields, err = schema.Apply(index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is now solved with the more resent commits?

err = eventsMapping(r, *info, content)
if err != nil {
r.Error(err)
m.Log.Error(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would definitively nice to make this more generic.

Best in a follow up PR.

@@ -38,23 +41,23 @@ func eventsMappingXPack(r mb.ReporterV2, m *MetricSet, content []byte) error {
for indexName, indexData := range data {
indexData, ok := indexData.(map[string]interface{})
if !ok {
continue
return fmt.Errorf("%v is not a map", indexName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move forward with the way it is at the moment and fix it if it breaks.

@ycombinator ycombinator merged commit 3b2399d into elastic:master Oct 25, 2018
@ycombinator ycombinator deleted the metricbeat-xpack-log-errors branch October 25, 2018 21:23
@ycombinator ycombinator added v6.6.0 and removed needs_backport PR is waiting to be backported to other branches. labels Oct 25, 2018
ycombinator added a commit that referenced this pull request Nov 5, 2018
…de (#8755)

Cherry-pick of PR #8308 to 6.x branch. Original message: 

This PR makes error messages, reporting, and logging consistent across all metricsets for Elastic stack products, including the xpack monitoring code paths:

* [x] `elasticsearch/ccr`
* [x] `elasticsearch/cluster_stats`
* [x] `elasticsearch/index`
* [x] `elasticsearch/index_recovery`
* [x] `elasticsearch/index_summary`
* [x] `elasticsearch/ml_job`
* [x] `elasticsearch/node`
* [x] `elasticsearch/node_stats`
* [x] `elasticsearch/pending_tasks`
* [x] `elasticsearch/shard`
* [x] `kibana/status`
* [x] `kibana/stats`
* [x] `logstash/node`
* [x] `logstash/node_stats`
ycombinator added a commit that referenced this pull request Dec 19, 2018
The cleanup in #8308 introduced a regression which caused the `xpack.enabled` flag in the `kibana` module to stop having any effect.

This PR fixes that issue by reverting the troublesome bits from #8308, ensuring that the struct field for the parsed configuration setting is exported.
ycombinator added a commit that referenced this pull request Dec 19, 2018
Cherry-pick of PR #9642 to 6.x branch. Original message: 

The cleanup in #8308 introduced a regression which caused the `xpack.enabled` flag in the `kibana` module to stop having any effect.

This PR fixes that issue by reverting the troublesome bits from #8308, ensuring that the struct field for the parsed configuration setting is exported.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants