Mongo dbstats #3228

scottcrespo · 2016-12-20T16:15:21Z

Opening the PR early so we have a place to discuss implementation. At this point, I've copied the patterns used by the status metricset, but I have some questions and recommendations. Keep in mind, I'm still learning about the platform, so I might be making some incorrect assumptions!

elasticsearch-release · 2016-12-20T16:17:03Z

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

elasticmachine · 2016-12-20T16:17:05Z

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

scottcrespo · 2016-12-20T16:40:09Z

metricbeat/module/mongodb/dbstats/dbstats.go

+func (m *MetricSet) Fetch() ([]common.MapStr, error) {
+
+	// establish connection to mongo
+	session, err := mgo.DialWithInfo(m.dialInfo)


@ruflin

This follows the pattern used by the status metricset here, but I'm not sure it makes sense to open and close the mongo connection each time fetch() is called.

According to the [docs],(https://www.elastic.co/guide/en/beats/metricbeat/current/metricset-details.html#_timeout_connections_to_services) connections should be established in the New() method and persist between fetches.

I'm going to begin implementing a pattern where the mongo session is persisted in each MetricSet instance to avoid constantly opening and closing connections. I would add that work to this existing PR. Or, would you prefer it's handled under a separate submission?

Lastly, based on this implementation I'm more confident that the mongodb module is not connecting directly to the individual instances, but would instead connect to the cluster. This means that any cluster member may answer the query and result in data inconsistencies. This issue is a bit more involved, and I don't know enough about the platform to be sure, so I may need to jump on a Zoom call or something to talk further.

+1 on persisting the connection. It is probably best to extract the connection setup logic to the module level and just use it as a function.

About connecting to single instances or the cluster can be come tricky. We had a similar issue with Kafka. Our recommendation is to install metricbeat on each edge node (some exceptions exist here), as it gives you also additional metrics about the system etc. I'm don't know the details of the mongodb clustering but for Kafka we solved it the following way (generic description):

Individual node stats are reported by each mongo instance

Global stats about cluster / indices are only fetched from master

This requires that a metricbeat instance can detect if it is a master or not. Also I assume this must be dynamic, as master can change over time.

I assume dbStats response is the same for each node, so only the data from master should be fetched.

Happy to jump on a zoom call to go into more details.

@ruflin

Connecting to single instances

Tricky indeed. Fortunately, mgo (the mongo client) has a dialInfo.Direct parameter. In mongobeat, I initiate a master connection to the cluster, and then create individual node connections to each node. You can see it here

I create the individual connections in the "controller" here

I've begun implementing something similar for the mongodb module, minus node discovery. This is implemented at the module level to share logic and prevent redundant connections.

Detecting master/primary

Easily implemented with mgo. You can specify read preference to always read from primary. If the primary changes, your reads will automatically be directed to the new primary.

dbStats response the same

Technically dbStats response (and serverStatus) are not the same for each node. Sometimes the differences are trivial, other times they're not. I'll cover the cases where they're not the same:

replica sets

In the case of a replica set, the state of the secondaries lag behind the primary. Thus, seeing the stats for each node helps monitor replication lag and success/failure. This is especially helpful in the case of multi-datacenter replication, where larger discrepancies are usually present.

sharded clusters

In the case of a sharded cluster, the complexity is compounded because nodes contain different databases and collections, and a shard can also be a replica set! =o

Conclusion

Given these considerations, I'm still having a hard time understanding what the best implementation for metricbeat would be. Given the current implementation - the node that is read from is implicit and can change from period to period. For example, you might connect to mongo on localhost and expect that you're reading from that node. But under the hood, mgo is actually discovering the other nodes in the cluster and directing reads to any one of the member nodes. Thus, I think it's possible to receive unexpected and inconsistent results.

If we enforce that metricbeat only establishes a direct connection to a single mongo instance on localhost, then all of this complexity goes away and output is very predictable. Yay! But, it limits the flexibility of the system.

If we allow the agent to connect to mongo as a cluster, and permit configuration of multiple hosts, then I think there's a lot of different directions we could take.

Definitely a Zoom call will be helpful. I'm sure you can teach me a lot more about how metricbeat works under the hood, so I can understand the best direction.

I don't have your email, but can you contact me at [email protected] and we'll set up a time? Perhaps we could talk sometime Thursday afternoon (your time).

@scottcrespo Sorry, I was mostly offline the last two days, so only saw the update now.

I always try to go with the simplest solution first, which I agree is the one only connecting to localhost. Because of docker I would probably rephrase this to: The mongodb module gets only the stats from the node it connects to, as it does not necessarly always have to be localhost on docker.

Could you update the PR so that it always only connects to the defined host (and not the cluster)?

About zoom call: Lets see how far we can take it here in the PR as I'm rarely only in the next days because of holidays. But I'm happy to answer any questions about metricbeat in more detail here if needed.

@ruflin

No problem! I've been on vacation as well, but I have a few days to work on beats before taking some additional time off.

I'm in the process of updating the mongodb module to establish direct connections to the hosts listed in the config, and to persist those connections between calls. I should be able to update this PR within a few days!

… node from holding up the reporting

… fields get lost when calling eventMapping()

…additional fields to dbstats metricset

…connections Metricbeat mongodb persistent connections

…he framework

scottcrespo · 2016-12-31T18:24:53Z

@ruflin

Ready for your review. Please let me know what changes I might need to make!

ruflin

Thanks for pushing this forward. Also finally back online again.

In general changes LGTM. I left some minor comments / question. The major logical change is that now Mongodb connects to a specific host instead of the cluster, correct?

@tsg Could you also have a look at the changes as you did the mongo implementation as far as I remember.

ruflin · 2017-01-03T07:20:04Z

metricbeat/_meta/beat.full.yml

@@ -117,7 +117,7 @@ metricbeat.modules:

 #------------------------------- MongoDB Module ------------------------------
 #- module: mongodb
-  #metricsets: ["status"]
+  #metricsets: ["status", "dbstats"]


We should probably sort this alphabetically. But not sure if we are consistent in other modules with that ...

ruflin · 2017-01-03T07:21:49Z

metricbeat/module/mongodb/dbstats/_meta/data.json

+    },
+    "mongodb":{
+        "dbstats":{
+            "example": "dbstats"


An example doc should be generated here with make generate-json. But that can still be done after merging the PR.

ruflin · 2017-01-03T07:22:33Z

metricbeat/module/mongodb/dbstats/_meta/fields.yml

+    - name: objects
+      type: long
+
+    - name: storage_size


what is the unit for all the _size fields?

ruflin · 2017-01-03T07:24:06Z

metricbeat/module/mongodb/dbstats/_meta/fields.yml

+    - name: file_size
+      type: long
+
+    - name: index_size


this is probably index_size.bytes and then you can add format: bytes to each block, so Kibana will automatically now that it is in bytes and can do some conversion.

ruflin · 2017-01-03T07:25:04Z

metricbeat/module/mongodb/dbstats/data.go

+	"indexes":      c.Int("indexes"),
+	"index_size":   c.Int("indexSize"),
+	// mmapv1 only
+	"ns_size_mb": c.Int("nsSizeMB"),


Should this field be optional as based on the comment, I assume it does not always exist?

ruflin · 2017-01-03T07:28:29Z

metricbeat/module/mongodb/dbstats/dbstats.go

+// Part of new is also setting up the configuration by processing additional
+// configuration entries if needed.
+func New(base mb.BaseMetricSet) (mb.MetricSet, error) {
+	dialInfo, err := mgo.ParseURL(base.HostData().URI)


Could you add an experimental flag here? We normally introduce new metricsets as experimental. This allows us to first get some real word feedback for it and still change the data structure if needed without having to wait for a major release. Have a look here: https://github.com/elastic/beats/blob/master/metricbeat/module/haproxy/stat/stat.go#L34

ruflin · 2017-01-03T07:31:06Z

metricbeat/module/mongodb/mongodb.go

+	nodeDialInfo.FailFast = true
+
+	logp.Info("Connecting to MongoDB node at %v", nodeDialInfo.Addrs)
+	fmt.Printf("Connecting to MongoDB node at %v", nodeDialInfo.Addrs)


The fmt should be removed here as we already log it.

ruflin · 2017-01-03T07:32:24Z

metricbeat/module/mongodb/status/status.go

+	if err != nil {
+		return nil, err
+	}
+
 	return &MetricSet{
 		BaseMetricSet: base,
 		dialInfo:      dialInfo,


Do we still need to store the dialInfo as we use now mongoSession?

ruflin · 2017-01-03T07:33:34Z

metricbeat/module/mongodb/status/status.go

-	}
-	defer session.Close()
-
-	session.SetMode(mgo.Monotonic, true)


the monotonic mode was removed. Is this intentional?

@ruflin

Yes, this was intentional. Monotonic Mode connects to the cluster as a whole and reads from the nearest secondary (not clear how nearest is defined). So, if your desired target is busy or goes down, mgo will read from a different node in the cluster. This failover strategy is what I was referring to earlier (which I think is undesirable) - where you may read from a node other than the target specified in the module's configuration, and obfuscate node failures.

Thus, the change I've implemented establishes direct connections to each of the nodes specified in the module's settings. You can see how I did it here. This ensures that we are reading from the intended node. And, if it's slow or fails, we will know about it.

Also, note that I've also set FailFast to True, which helps identity latent and non-responsive servers sooner rather than later.

ruflin · 2017-01-03T07:34:23Z

metricbeat/module/mongodb/status/status.go

@@ -17,43 +15,52 @@ TODOs:
 	* add a metricset for "metrics" data
 */

-var debugf = logp.MakeDebug("mongodb.status")


I assume you removed this, because it was not used. We normally keep this in to make it easy to add debug messages.

scottcrespo · 2017-01-06T17:40:02Z

@ruflin

Thanks for all the feedback! I'm offline this week, but I'll implement your suggestions startkng next week!

…csets in alphabetical order

…ting

…sion field

scottcrespo · 2017-01-08T19:04:36Z

@ruflin

Implemented all of your suggestions. Ready for you to take another look =)

ruflin · 2017-01-10T09:00:10Z

jenkins, test it

ruflin · 2017-01-10T09:01:43Z

@scottcrespo LGTM. Could you add an update the CHANGELOG.asciidoc and squash the commits into 1?
@tsg Could you also have a quick look?

…odule (PR 3228)

ruflin · 2017-01-11T12:11:34Z

jenkins, test it

ruflin · 2017-01-11T13:11:12Z

@scottcrespo Merged. Thanks a lot for this contribution.

scottcrespo · 2017-01-11T15:14:08Z

@ruflin

You're welcome! It's been great working with you on this. Thanks for all of the help and feedback.

Scott Crespo added 3 commits December 19, 2016 15:38

initial dbstats implementation and integration tests

9274966

implemented fields.yml for dbstats metricset

f28f347

generated documentation and templates via make update

c811c3e

scottcrespo commented Dec 20, 2016

View reviewed changes

added dbstats documentation

590b9e2

ruflin added in progress Pull request is currently in progress. Metricbeat Metricbeat labels Dec 21, 2016

Scott Crespo added 20 commits December 22, 2016 13:50

new implementation that establishes direct connetions to mongo hosts

b6f8da3

successful reporting using direct node connections

74f9ee0

debugged field name. integration tests pass

dd895b5

debugged avg_obj_size field for proper mapping

2c92245

set FailFast = True for direct connections to prevent a nonresponsive…

5a9ae7f

… node from holding up the reporting

provide mongodb host parser to AddMetricSet()

8e1934c

implementing multi-node direct reporting for mongodb.serverStatus

1f420d7

changed database response back to map[string]interface{}{}. otherwise…

65d8d89

… fields get lost when calling eventMapping()

debugging test assertion values so integration test suite passes

a4a6e75

debugged integration tests to work with metricbeats testsuite. added …

3c6d97c

…additional fields to dbstats metricset

copied original metricbeat.yml. Accidentally over-wrote it

aafaafe

Merge pull request #1 from scottcrespo/metricbeat-mongodb-persistent-…

663b4da

…connections Metricbeat mongodb persistent connections

added documentation and additional log entry in mongodb module

99bc01f

more info logs to debug travis ci failures

82be711

additional logging to debug CI. removed host parameter to new metricset

98b86a0

updated docstring. add mongo url parser back to new metric set call

f207fac

added build integration comment to integration test modules

2b3f230

eliminated multi-node concurrency patterns because it is handled by t…

abf8973

…he framework

implementing eventfetcher, not eventsfetcher

5bde2ee

updated module config to include dbstats metricset

976294a

ran make update

7b0a38f

ruflin added review and removed in progress Pull request is currently in progress. labels Jan 2, 2017

ruflin reviewed Jan 3, 2017

View reviewed changes

Scott Crespo added 11 commits January 8, 2017 12:25

updated metricbeat mongo modules template configuration to list metri…

827348c

…csets in alphabetical order

udated mongobeat dbstats fields to use byte format where appropriate

2de87a8

handle mb field

54009f4

update import list order

594085b

updating mongodb metricset field format

48f9a53

added experimental flag to mongodb dbstats metricset

757dbd0

removed redundant Printf statement

06b7220

added debugf variable back into metricset modules for easy debug prin…

e4b627a

…ting

ran make generate-json for dbstats metricset

28e51fd

removed dialinfo from mongodb metricsets because we have the mongoses…

1c80f5e

…sion field

updated data mapping to make mmapv1-specific fields optional

a8450fb

ruflin approved these changes Jan 10, 2017

View reviewed changes

Scott Crespo added 2 commits January 10, 2017 14:47

updated changelog to include latest changes to Metricbeat's MongoDB m…

35ac943

…odule (PR 3228)

Merge branch 'master' of github.com:elastic/beats into mongo-dbstats

a9dd25f

ruflin merged commit 2ddb800 into elastic:master Jan 11, 2017

scottcrespo deleted the mongo-dbstats branch January 11, 2017 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mongo dbstats #3228

Mongo dbstats #3228

scottcrespo commented Dec 20, 2016 •

edited

Loading

elasticsearch-release commented Dec 20, 2016

elasticmachine commented Dec 20, 2016

scottcrespo Dec 20, 2016 •

edited

Loading

ruflin Dec 21, 2016

scottcrespo Dec 21, 2016

ruflin Dec 23, 2016

scottcrespo Dec 29, 2016

scottcrespo commented Dec 31, 2016

ruflin left a comment

ruflin Jan 3, 2017

ruflin Jan 3, 2017

ruflin Jan 3, 2017

ruflin Jan 3, 2017

ruflin Jan 3, 2017

ruflin Jan 3, 2017

ruflin Jan 3, 2017

ruflin Jan 3, 2017

ruflin Jan 3, 2017

scottcrespo Jan 8, 2017

ruflin Jan 3, 2017

scottcrespo commented Jan 6, 2017

scottcrespo commented Jan 8, 2017

ruflin commented Jan 10, 2017

ruflin commented Jan 10, 2017

ruflin commented Jan 11, 2017

ruflin commented Jan 11, 2017

scottcrespo commented Jan 11, 2017

Mongo dbstats #3228

Mongo dbstats #3228

Conversation

scottcrespo commented Dec 20, 2016 • edited Loading

elasticsearch-release commented Dec 20, 2016

elasticmachine commented Dec 20, 2016

scottcrespo Dec 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Connecting to single instances

Detecting master/primary

dbStats response the same

Conclusion

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottcrespo commented Dec 31, 2016

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottcrespo commented Jan 6, 2017

scottcrespo commented Jan 8, 2017

ruflin commented Jan 10, 2017

ruflin commented Jan 10, 2017

ruflin commented Jan 11, 2017

ruflin commented Jan 11, 2017

scottcrespo commented Jan 11, 2017

scottcrespo commented Dec 20, 2016 •

edited

Loading

scottcrespo Dec 20, 2016 •

edited

Loading