Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring prometheus module to aggregate metrics based on metric family #4075

Merged
merged 2 commits into from
Apr 21, 2017

Conversation

vjsamuel
Copy link
Contributor

The current prometheus collector implementation separates each metric into a separate event. This is not how prometheus is meant to be understood. Prometheus has the concept of MetricFamily.

Example of a metric family would be:

# TYPE apiserver_request_latencies histogram
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="125000"} 11542
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="250000"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="500000"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="1e+06"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="2e+06"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="4e+06"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="8e+06"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="GET",le="+Inf"} 11543
apiserver_request_latencies_sum{resource="accounts",verb="GET"} 1.7285094e+07
apiserver_request_latencies_count{resource="accounts",verb="GET"} 11543
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="125000"} 566
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="250000"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="500000"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="1e+06"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="2e+06"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="4e+06"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="8e+06"} 567
apiserver_request_latencies_bucket{resource="accounts",verb="LIST",le="+Inf"} 567
apiserver_request_latencies_sum{resource="accounts",verb="LIST"} 4.695524e+06
apiserver_request_latencies_count{resource="accounts",verb="LIST"} 567

This metric family can be broken down as:
metric_name: apiserver_request_latencies
metric_type: histogram

It has two different label combinations:
{resource="accounts",verb="LIST"} and {resource="accounts",verb="GET"}

Hence two events ought to be created. one of which would look similar to:

      "apiserver_request_latencies": {
        "buckets": {
          "+Inf": 3696,
          "1000000": 0,
          "125000": 0,
          "2000000": 0,
          "250000": 0,
          "4000000": 0,
          "500000": 0,
          "8000000": 0
        },
        "count": 3696,
        "sum": 1668259775362.000000
      },
      "labels": {
        "resource": "roles",
        "verb": "WATCHLIST"
      }
    }
  }

This enables histograms and summaries to be looked adjacent to each other.

@vjsamuel vjsamuel force-pushed the prometheus_refactor branch from 7a9e535 to 974a687 Compare April 21, 2017 00:49
@elasticmachine
Copy link
Collaborator

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

1 similar comment
@elasticmachine
Copy link
Collaborator

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice addition. It will heavily improve the collector metricset and create much fewer events. By using the prometheus client we also have kind of a guarantee that it keeps working.

@@ -78,6 +78,8 @@ https://github.com/elastic/beats/compare/v5.1.1...master[Check the HEAD diff]
- Make system process metricset honor the cpu_ticks config option. {issue}3590[3590]
- Support common.Time in mapstriface.toTime() {pull}3812[3812]
- Fixing panic on prometheus collector when label has , {pull}3947[3947]
- Fixing prometheus collector to aggregate metrics based on metric family. {pull}4075[4075]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not put this under bugfixes but breaking changes as it changes the data structure.

@@ -6,6 +6,9 @@ import (
"github.com/elastic/beats/metricbeat/helper"
"github.com/elastic/beats/metricbeat/mb"
"github.com/elastic/beats/metricbeat/mb/parse"

"fmt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move the fmt on the top to have the standard imports together?

}

eventList[promEvent.labelHash][promEvent.key] = promEvent.value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the promEvent.key are quite constant over time so we don't have a field explosion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

labels: common.MapStr{
"handler": "query",
"quantile": 0.99,
key: "http_request_duration_microseconds",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future step we could also extract the "units" part like microseconds as I assume this is also part of the convention.

"github.com/elastic/beats/libbeat/common"
dto "github.com/prometheus/client_model/go"
"math"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move these 2 to the top?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our general logic is for imports:

standard imports

beats imports

external imports

for _, metric := range metrics {
event := PromEvent{
key: name,
labelHash: "#",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that exactly work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sure that all metrics that dont have any tag values gets grouped into a single document. its a carry over from the previous implementation that was there before this change.

key := strconv.FormatFloat((100 * quantile.GetQuantile()), 'f', -1, 64)

if math.IsNaN(quantile.GetValue()) == false {
percentileMap["p"+key] = quantile.GetValue()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is already under percentile I don't think we ned to add a p in front of the key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

if len(percentileMap) != 0 {
value["percentiles"] = percentileMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make it percentile as then it reads precentile.99: 0.2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

bucketMap[key] = bucket.GetCumulativeCount()
}

value["buckets"] = bucketMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, lets make it bucket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


import (
"fmt"
dto "github.com/prometheus/client_model/go"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see import rules above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -13,11 +13,13 @@
},
"prometheus": {
"collector": {
"label": {
"labels": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have here label even though I'm aware this is not consitent with most of the other label fields with have. They should also be singular.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vjsamuel vjsamuel force-pushed the prometheus_refactor branch from 52b9a68 to f45e386 Compare April 21, 2017 08:34
@vjsamuel vjsamuel force-pushed the prometheus_refactor branch from f45e386 to a5ee39f Compare April 21, 2017 08:35
Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @tsg could you also have a look at this one as I remember we had quite a few discussions about the prometheus format ...

@@ -79,6 +80,7 @@ https://github.com/elastic/beats/compare/v5.1.1...master[Check the HEAD diff]
- Support common.Time in mapstriface.toTime() {pull}3812[3812]
- Fixing panic on prometheus collector when label has , {pull}3947[3947]
- Fix MongoDB dbstats fields mapping. {pull}4025[4025]
- Fixing prometheus collector to aggregate metrics based on metric family. {pull}4075[4075]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should now be removed.

@tsg
Copy link
Contributor

tsg commented Apr 21, 2017

@ruflin @vjsamuel The new document organization makes a lot of sense to me, and it's great we can group more metrics together. 👍 👍 👍

@ruflin ruflin merged commit aad3dbb into elastic:master Apr 21, 2017
@vjsamuel vjsamuel deleted the prometheus_refactor branch April 21, 2017 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants