Fix #121: Add example prometheus setup #135

abesto · 2017-02-26T11:41:35Z

Add Prometheus, Grafana to docker-compose.yml. Document basic usage and setup. See #121 for reasoning.

I attempted to "ship" with a state where Grafana is already configured, but it's not completely trivial (where do you store the initial state? How do you get it to Grafana) leading to complexity. It would also remove some of the learning we expect to provide - automating it here means users will need to re-discover it later.

abesto · 2017-02-26T11:45:20Z

On second thought, maybe this doesn't satisfy the original intention, ie. to show what these metrics mean. A good way to solve that could be to release a Zipkin dashboard on https://grafana.net/dashboards.

codefromthecrypt · 2017-02-27T05:08:08Z

unless both are required in one step, probably good to merge this (as at least it helps people understand and test the integration of metrics regardless of how they are interpreted)

abesto · 2017-02-27T19:43:39Z

While this should work fine without a dashboard, I went ahead and drafted a dashboard that includes all the currently exported data. I tried to organize it to the best of my ability: https://grafana.net/dashboards/1598 (currently draft, meaning it's viewable via this link, but doesn't show up in searches. At least I think that's what it means)

I also created an OpenZipkin org on grafana.net so that we can share write access: https://grafana.net/openzipkin. Drop me your grafana.net username to get access.

So I'd say: let's do one round of iteration on the dashboard (either by review or direct edit). I'll go ahead and update the README accordingly; once done, we can release the dashboard and merge this.

abesto · 2017-03-01T19:37:32Z

@kristofa @klette what are your thoughts on this?

kristofa · 2017-03-01T20:38:43Z

@abesto Looks good! Could we pre-configure the grafana docker container with the Zipkin dashboard you made on grafana.net? I can have a look at the dashboard. My grafana.net username is kristofa.

abesto · 2017-03-01T21:04:07Z

@kristofa Initially I thought the only way to do that would be to commit the database file into the repo, but on second reading there may be a better way (dashboards.json config value). Checking into that now. Added you to the grafana.net org.

Update: that doesn't work for the dashboard format used by grafana.net, and it doesn't support adding data sources. Created a small shell script to set up the data source and the dashboard on startup (pulling the dashboard from grafana.net)

codefromthecrypt · 2017-03-02T03:45:02Z

I just signed up in grafana as adriancole

…

On Tue, Feb 28, 2017 at 2:43 AM, Zoltán Nagy ***@***.***> wrote: While this should work fine without a dashboard, I went ahead and drafted a dashboard that includes all the currently exported data. I tried to organize it to the best of my ability: https://grafana.net/dashboards/1598 (currently draft, meaning it's viewable via this link, but doesn't show up in searches. At least I think that's what it means) I also created an OpenZipkin org on grafana.net so that we can share write access: https://grafana.net/openzipkin. Drop me your grafana.net username to get access. So I'd say: let's do one round of iteration on the dashboard (either by review or direct edit). I'll go ahead and update the README accordingly; once done, we can release the dashboard and merge this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#135 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAD61zSLtxFeOopJxqz71knKipvqJUP1ks5rgydrgaJpZM4MMWtf> .

abesto · 2017-03-04T18:41:06Z

@adriancole added you on grafana.net to the org.

kristofa · 2017-03-05T10:59:58Z

I noticed we don't seem to expose any metrics that indicate failures in the dashboard.
Don't we have metrics that for example expose http 5xx responses?
After I invoked some requests to zipkin-web I noticed new metrics appeared in Prometheus indicating the number of requests / endpoint (eg status_200_api_v1_services). Do we have status_5xx_api_v2_services or similar? That would be useful.

Also, the status_200_... metrics are more interesting compared to the http_sessions_active metric IMHO because they are grouped by endpoint. The status_ metrics are counters so that means we'll have to use rate to get the req/sec.

If we want to make the dashboard production ready we might also sum the metrics to show the sum over all instances. For the docker container case that doesn't really matter since we only have 1 instance running at port 9411 but in production we are likely to have more instances. For example a graph showing the successful requests / sec to the api_v1_services endpoint for all service instances might use following Prometheus expression:

sum(rate(status_200_api_v1_services{job="zipkin"}[1m]))

codefromthecrypt · 2017-03-06T13:47:00Z

@kristofa I'm pretty sure we can make our api 500 :) (then answer the question about metrics path). It is standard spring boot metrics.. At any rate, we ought to think about testing this integration at some point. For example, we have some docker tests already for storage. It would be neat to have a test containers .. test.. to ensure whatever we say here actually works moving forward.

cc @shakuzen @bsideup

bsideup · 2017-03-06T13:48:53Z

@adriancole shouldn't be hard :) I'm a bit out of a context, but let me know if you need any help to configure TestContainers for it :)

abesto · 2017-03-06T20:01:21Z

@kristofa Totally agree on the status_.* metrics, not sure how I missed them. I went digging and learned that, for the response times at least, we can use a query like {__name__=~"^response_.*", job="zipkin"} - this will let the dashboard pick up new response time metrics as they're added. Unfortunately when aggregation is applied, it seems to merge the metrics - for instance rate({__name__=~"^status_200_.*", job="zipkin"}[1m]) is a single metric. Am I missing something? Do we really have to hard-code the endpoints into the dashboard? (Also played around with Grafana templating, but the end result is effectively the same).

Side-note on doing a sum on top of the rate: generally speaking, I prefer having individual metrics, and setting Grafana to stack the appropriate values. This provides the same value at the "top" of the stack, but more detail is immediately visible (say, if one node in the cluster is misbehaving). In this case however having both the "node" and the "endpoint" axis is not feasible, and endpoint is the more important one. Maybe this is a case where Grafana templating can help (set up a query variable for nodes?)

kristofa · 2017-03-09T20:47:19Z

@abesto We stepped away from having metric names which include dimensions that are interested for filtering like statusClass and use labels instead. So the status_200_api_v1_services metric could instead look like status_api_total with labels version, statusClass, path and so you could filter by specifying: status_api_total{version="v1", statusClass="2xx", path="services"}. We learned from the Prometheus developers that this is a nicer way to define metrics. Also new endpoints would automatically show up as values for the path label.

We don't typically do aggregation at the client / grafana side and we often define Prometheus recording rules. These are pre-calculated at the server side to prevent that expensive and often used queries are calculated for every client request. Triggering expensive queries and getting more detailed data to the client might work depending on how many time series your Prometheus server has to maintain and serve. But indeed, the more you aggregate the more likely it is you have to invoke a separate query to get more details for example to find a single bad behaving instance.

abesto · 2017-03-09T21:26:14Z

@kristofa So the way Zipkin exposes metrics currently is not following the established best practices of the Prometheus community / devs. Thanks for educating!

Let me test my understanding. Even after we restructure the metrics as you described (for instance, to have one metric called http_response_status_count with labels version, path, and statusClass), we'll still hit the same problem on rate calculation, right? After the refactor, the query would be something like sum(rate(http_response_status_count)), which will still become a single metric (same as the old query using __name__). So if we want to both 1. keep the metric type a gauge and 2. not update the Grafana dashboard each time a new endpoint is added, we'd need to set up a recording rule on the Prometheus side, something like http_response_status_count_rate = sum(rate(http_response_status_count[1m])) by (host)? And then in Grafana the query will be just http_response_status_count_rate, showing one (per-minute rate) metric per HTTP endpoint (with new endpoints showing up automatically).

Is that correct? Is this, do you think, the idiomatic way to approach this?

eirslett · 2017-03-09T22:06:18Z

README.md

+Zipkin comes with a built-in Prometheus metric exporter. The main
+`docker-compose.yml` file starts Prometheus configured to scrape Zipkin, exposes
+it on port `9090`. You can open `$DOCKER_HOST_IP:9090` and start exploring the
+metrics (which are available on the `/promethes` endpoint of Zipkin).


Typo there. Fixing, thanks!

kristofa · 2017-03-19T11:36:02Z

Sorry for the late response.

We can use an expression like sum(rate(http_response_status_count[1m])) by (path, statusClass). When using this you will get a new entry in your graph when a new path is added without having to update the graph definition. In my opinion this also is more readable compared to using expressions like {__name__=~"^response_.*"}.

The rate function calculates the per-second average rate. See rate explanation.

The recording rules I talked about make sense when the expression is expensive to calculate and so the rule moves the calculation to the server side where it will be calculated at scheduled intervals and avoid calculation for every client request.

Here are the Prometheus metric / label name conventions: Metric and label naming

abesto · 2017-03-19T18:43:17Z

I didn't put one and one together, thanks for explaining things to a Prometheus newbie! I've just understood how the whole aggregation / by clause business works. Totally agree on the way using labels being more readable. (I tried to get a reasonably meaningful HTTP status code graph given the current metrics, but by (name) and by (__name__) don't seem to work. Which is fine, we need to do the Right Thing anyway).

I see two roads ahead at this point, looking mostly to @adriancole for advice:

We can merge this PR as is, get the basic example out there. Then restructure the Prometheus metrics whenever we get around to it (breaking the current dashboard), and update the dashboard on graphana.net. Pro: example is out there ASAP. Con: dashboards will break, unless we double-publish metrics as both http_status{path="/api/v1/services"} and status_200_api_v1_services.
We can put this PR on hold until we get around to restructuring the Prometheus metrics. Pro: the first released dashboard will be kickass, no compatibility problems. Con: delays releasing the example.

kristofa · 2017-03-19T21:23:49Z

I would get the dashboard out as it is now. It is already useful to show the integration and existing metrics. We can always iterate later.

codefromthecrypt · 2017-03-23T15:47:09Z

fyi I noticed that since we added prometheus to zipkin, upstream formalized it.. maybe we should consider their metrics endpoint before formalizing this (or maybe we do it after)
openzipkin/zipkin#1144 (comment)

abesto · 2017-06-11T05:41:57Z

Rebased on top of master.

Re. replacing metrics collection with the upstream client: that won't change the format of the metrics exposed; the official client pretty much does the same as what our current exporter does. It does add some new metrics, with which we can extend the dashboard later (see openzipkin/zipkin#1609)

We can also do some more magic on the Prometheus server side to get nicer response count metrics, will try to do that now (see prometheus/client_java#255 (comment))

abesto · 2017-06-11T06:32:09Z

With the relabeling rules the dashboard can now be significantly smarter, with automatically populated response code count and response time graphs. Updated the dashboard on grafana.net, looks something like this:

@kristofa @adriancole Some time has passed, and there are new changes. I think this is ready to merge, waiting for a nod from you :)

abesto · 2017-06-12T10:16:26Z

prometheus/create-datasource-and-dashboard.sh

-          http://grafana:3000/api/dashboards/import
+echo '{"dashboard": ' > data.json
+curl -s https://grafana.com/api/dashboards/${dashboard_id}/revisions/${last_revision}/download >> data.json
+echo ', "inputs": [{"name": "DS_PROM", "pluginId": "prometheus", "type": "datasource", "value": "prom"}], "overwrite": false}' >> data.json


This rewrite was needed because as the dashboard JSON grows (more graphs added), we ran into this issue:

xargs: argument line too long

abesto · 2017-06-12T10:31:28Z

Added message count (received and dropped), spans received count, and bytes received graphs to the dashboard. They automatically pick up transports (see the new relabeling rules).

abesto · 2017-08-16T21:44:01Z

After updating the rewrite rules for the changes in openzipkin/zipkin#1609, with https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734, here's the current state. Things to note:

auto-populated template variable with the Zipkin instances
added the Zipkin instance to the label of all metrics, except for the response count which sums over all instances (otherwise we get waaay too many metrics). Even though the graph is not exploded by instance, the filtering is still applied
hid the labels from graphs with over two lines of labels
made all the labels human-friendly (no more metric{instance=...})
added 90th percentile response time graph

Meaning that we should name it the same way as the relevant metric from Prometheus itself.

codefromthecrypt · 2017-10-10T14:13:38Z

very nice.. going out for zipkin 2.2

abesto · 2017-10-10T14:56:52Z

Cool!

⚠️ Watch out: the version of the dashboard currently on grafana.com works with the pre-Spring metrics. Once 2.2 is released, https://grafana.com/dashboards/1598/ needs to be updated with the JSON in https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734.

codefromthecrypt · 2017-10-10T15:34:57Z

⚠️ Watch out: the version of the dashboard currently on grafana.com works with the pre-Spring metrics. Once 2.2 is released, https://grafana.com/ dashboards/1598/ needs to be updated with the JSON in https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734.

uploaded!

abesto requested a review from kristofa March 1, 2017 19:37

eirslett reviewed Mar 9, 2017

View reviewed changes

kristofa approved these changes Mar 19, 2017

View reviewed changes

codefromthecrypt mentioned this pull request Jun 5, 2017

qps openzipkin/zipkin#1600

Closed

abesto force-pushed the prometheus branch from a67dc0f to 9ab0348 Compare June 11, 2017 05:40

codefromthecrypt mentioned this pull request Jun 12, 2017

Why is the data lost in elasticsearch ? openzipkin/zipkin#1141

Closed

abesto force-pushed the prometheus branch from a7a12e8 to 59e8505 Compare June 12, 2017 10:15

abesto commented Jun 12, 2017

View reviewed changes

abesto mentioned this pull request Aug 16, 2017

Use upstream Prometheus client and Spring integration openzipkin/zipkin#1609

Merged

2 tasks

abesto and others added 10 commits October 10, 2017 14:41

Fix #121: Add example prometheus setup

8643fdd

Mention Zipkin dashboard published at grafana.net

7e945d0

Set up Prometheus data source and dashboard in docker-compose.yml

5e316e4

Fetch latest revision of dashboard from grafana.net

b4a480a

Fix typo

a8952b5

[prometheus] Added relabel rules for nicer response count, timing

230718b

grafana.net is now grafana.com

a038cdd

Oops. The request count is in fact a counter, not a gauge.

06a21b9

Meaning that we should name it the same way as the relevant metric from Prometheus itself.

[prometheus] Add zipkin_collector metrics relabeling

7bf89c9

[prometheus] Update rewrite rules for openzipkin/zipkin#1609

1caa146

codefromthecrypt force-pushed the prometheus branch from 63bfa19 to 1caa146 Compare October 10, 2017 07:08

This was referenced Oct 10, 2017

Moves log messages behind drop messages to debug level openzipkin/zipkin#1766

Closed

Prevents backlog when writing to Elasticsearch openzipkin/zipkin#1765

Merged

codefromthecrypt merged commit 9f574ff into master Oct 10, 2017

codefromthecrypt deleted the prometheus branch October 10, 2017 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #121: Add example prometheus setup #135

Fix #121: Add example prometheus setup #135

abesto commented Feb 26, 2017 •

edited

Loading

abesto commented Feb 26, 2017

codefromthecrypt commented Feb 27, 2017

abesto commented Feb 27, 2017 •

edited

Loading

abesto commented Mar 1, 2017

kristofa commented Mar 1, 2017

abesto commented Mar 1, 2017 •

edited

Loading

codefromthecrypt commented Mar 2, 2017 via email

abesto commented Mar 4, 2017

kristofa commented Mar 5, 2017

codefromthecrypt commented Mar 6, 2017

bsideup commented Mar 6, 2017

abesto commented Mar 6, 2017

kristofa commented Mar 9, 2017 •

edited

Loading

abesto commented Mar 9, 2017

eirslett Mar 9, 2017

abesto Mar 19, 2017

kristofa commented Mar 19, 2017

abesto commented Mar 19, 2017

kristofa commented Mar 19, 2017

codefromthecrypt commented Mar 23, 2017

abesto commented Jun 11, 2017

abesto commented Jun 11, 2017

abesto Jun 12, 2017

abesto commented Jun 12, 2017

abesto commented Aug 16, 2017

codefromthecrypt commented Oct 10, 2017

abesto commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017 via email

Fix #121: Add example prometheus setup #135

Fix #121: Add example prometheus setup #135

Conversation

abesto commented Feb 26, 2017 • edited Loading

abesto commented Feb 26, 2017

codefromthecrypt commented Feb 27, 2017

abesto commented Feb 27, 2017 • edited Loading

abesto commented Mar 1, 2017

kristofa commented Mar 1, 2017

abesto commented Mar 1, 2017 • edited Loading

codefromthecrypt commented Mar 2, 2017 via email

abesto commented Mar 4, 2017

kristofa commented Mar 5, 2017

codefromthecrypt commented Mar 6, 2017

bsideup commented Mar 6, 2017

abesto commented Mar 6, 2017

kristofa commented Mar 9, 2017 • edited Loading

abesto commented Mar 9, 2017

eirslett Mar 9, 2017

Choose a reason for hiding this comment

abesto Mar 19, 2017

Choose a reason for hiding this comment

kristofa commented Mar 19, 2017

abesto commented Mar 19, 2017

kristofa commented Mar 19, 2017

codefromthecrypt commented Mar 23, 2017

abesto commented Jun 11, 2017

abesto commented Jun 11, 2017

abesto Jun 12, 2017

Choose a reason for hiding this comment

abesto commented Jun 12, 2017

abesto commented Aug 16, 2017

codefromthecrypt commented Oct 10, 2017

abesto commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017 via email

abesto commented Feb 26, 2017 •

edited

Loading

abesto commented Feb 27, 2017 •

edited

Loading

abesto commented Mar 1, 2017 •

edited

Loading

kristofa commented Mar 9, 2017 •

edited

Loading