Skip to content
This repository has been archived by the owner on Oct 23, 2019. It is now read-only.

Fix #121: Add example prometheus setup #135

Merged
merged 10 commits into from
Oct 10, 2017
Merged

Fix #121: Add example prometheus setup #135

merged 10 commits into from
Oct 10, 2017

Conversation

abesto
Copy link
Contributor

@abesto abesto commented Feb 26, 2017

Add Prometheus, Grafana to docker-compose.yml. Document basic usage and setup. See #121 for reasoning.

I attempted to "ship" with a state where Grafana is already configured, but it's not completely trivial (where do you store the initial state? How do you get it to Grafana) leading to complexity. It would also remove some of the learning we expect to provide - automating it here means users will need to re-discover it later.

@abesto
Copy link
Contributor Author

abesto commented Feb 26, 2017

On second thought, maybe this doesn't satisfy the original intention, ie. to show what these metrics mean. A good way to solve that could be to release a Zipkin dashboard on https://grafana.net/dashboards.

@codefromthecrypt
Copy link

unless both are required in one step, probably good to merge this (as at least it helps people understand and test the integration of metrics regardless of how they are interpreted)

@abesto
Copy link
Contributor Author

abesto commented Feb 27, 2017

While this should work fine without a dashboard, I went ahead and drafted a dashboard that includes all the currently exported data. I tried to organize it to the best of my ability: https://grafana.net/dashboards/1598 (currently draft, meaning it's viewable via this link, but doesn't show up in searches. At least I think that's what it means)

screenshot

I also created an OpenZipkin org on grafana.net so that we can share write access: https://grafana.net/openzipkin. Drop me your grafana.net username to get access.

So I'd say: let's do one round of iteration on the dashboard (either by review or direct edit). I'll go ahead and update the README accordingly; once done, we can release the dashboard and merge this.

@abesto abesto requested a review from kristofa March 1, 2017 19:37
@abesto
Copy link
Contributor Author

abesto commented Mar 1, 2017

@kristofa @klette what are your thoughts on this?

@kristofa
Copy link
Contributor

kristofa commented Mar 1, 2017

@abesto Looks good! Could we pre-configure the grafana docker container with the Zipkin dashboard you made on grafana.net? I can have a look at the dashboard. My grafana.net username is kristofa.

@abesto
Copy link
Contributor Author

abesto commented Mar 1, 2017

@kristofa Initially I thought the only way to do that would be to commit the database file into the repo, but on second reading there may be a better way (dashboards.json config value). Checking into that now. Added you to the grafana.net org.

Update: that doesn't work for the dashboard format used by grafana.net, and it doesn't support adding data sources. Created a small shell script to set up the data source and the dashboard on startup (pulling the dashboard from grafana.net)

@codefromthecrypt
Copy link

codefromthecrypt commented Mar 2, 2017 via email

@abesto
Copy link
Contributor Author

abesto commented Mar 4, 2017

@adriancole added you on grafana.net to the org.

@kristofa
Copy link
Contributor

kristofa commented Mar 5, 2017

I noticed we don't seem to expose any metrics that indicate failures in the dashboard.
Don't we have metrics that for example expose http 5xx responses?
After I invoked some requests to zipkin-web I noticed new metrics appeared in Prometheus indicating the number of requests / endpoint (eg status_200_api_v1_services). Do we have status_5xx_api_v2_services or similar? That would be useful.

Also, the status_200_... metrics are more interesting compared to the http_sessions_active metric IMHO because they are grouped by endpoint. The status_ metrics are counters so that means we'll have to use rate to get the req/sec.

If we want to make the dashboard production ready we might also sum the metrics to show the sum over all instances. For the docker container case that doesn't really matter since we only have 1 instance running at port 9411 but in production we are likely to have more instances. For example a graph showing the successful requests / sec to the api_v1_services endpoint for all service instances might use following Prometheus expression:

sum(rate(status_200_api_v1_services{job="zipkin"}[1m]))

@codefromthecrypt
Copy link

@kristofa I'm pretty sure we can make our api 500 :) (then answer the question about metrics path). It is standard spring boot metrics.. At any rate, we ought to think about testing this integration at some point. For example, we have some docker tests already for storage. It would be neat to have a test containers .. test.. to ensure whatever we say here actually works moving forward.

cc @shakuzen @bsideup

@bsideup
Copy link
Contributor

bsideup commented Mar 6, 2017

@adriancole shouldn't be hard :) I'm a bit out of a context, but let me know if you need any help to configure TestContainers for it :)

@abesto
Copy link
Contributor Author

abesto commented Mar 6, 2017

@kristofa Totally agree on the status_.* metrics, not sure how I missed them. I went digging and learned that, for the response times at least, we can use a query like {__name__=~"^response_.*", job="zipkin"} - this will let the dashboard pick up new response time metrics as they're added. Unfortunately when aggregation is applied, it seems to merge the metrics - for instance rate({__name__=~"^status_200_.*", job="zipkin"}[1m]) is a single metric. Am I missing something? Do we really have to hard-code the endpoints into the dashboard? (Also played around with Grafana templating, but the end result is effectively the same).

Side-note on doing a sum on top of the rate: generally speaking, I prefer having individual metrics, and setting Grafana to stack the appropriate values. This provides the same value at the "top" of the stack, but more detail is immediately visible (say, if one node in the cluster is misbehaving). In this case however having both the "node" and the "endpoint" axis is not feasible, and endpoint is the more important one. Maybe this is a case where Grafana templating can help (set up a query variable for nodes?)

@kristofa
Copy link
Contributor

kristofa commented Mar 9, 2017

@abesto We stepped away from having metric names which include dimensions that are interested for filtering like statusClass and use labels instead. So the status_200_api_v1_services metric could instead look like status_api_total with labels version, statusClass, path and so you could filter by specifying: status_api_total{version="v1", statusClass="2xx", path="services"}. We learned from the Prometheus developers that this is a nicer way to define metrics. Also new endpoints would automatically show up as values for the path label.

We don't typically do aggregation at the client / grafana side and we often define Prometheus recording rules. These are pre-calculated at the server side to prevent that expensive and often used queries are calculated for every client request. Triggering expensive queries and getting more detailed data to the client might work depending on how many time series your Prometheus server has to maintain and serve. But indeed, the more you aggregate the more likely it is you have to invoke a separate query to get more details for example to find a single bad behaving instance.

@abesto
Copy link
Contributor Author

abesto commented Mar 9, 2017

@kristofa So the way Zipkin exposes metrics currently is not following the established best practices of the Prometheus community / devs. Thanks for educating!

Let me test my understanding. Even after we restructure the metrics as you described (for instance, to have one metric called http_response_status_count with labels version, path, and statusClass), we'll still hit the same problem on rate calculation, right? After the refactor, the query would be something like sum(rate(http_response_status_count)), which will still become a single metric (same as the old query using __name__). So if we want to both 1. keep the metric type a gauge and 2. not update the Grafana dashboard each time a new endpoint is added, we'd need to set up a recording rule on the Prometheus side, something like http_response_status_count_rate = sum(rate(http_response_status_count[1m])) by (host)? And then in Grafana the query will be just http_response_status_count_rate, showing one (per-minute rate) metric per HTTP endpoint (with new endpoints showing up automatically).

Is that correct? Is this, do you think, the idiomatic way to approach this?

README.md Outdated
Zipkin comes with a built-in Prometheus metric exporter. The main
`docker-compose.yml` file starts Prometheus configured to scrape Zipkin, exposes
it on port `9090`. You can open `$DOCKER_HOST_IP:9090` and start exploring the
metrics (which are available on the `/promethes` endpoint of Zipkin).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo there. Fixing, thanks!

@kristofa
Copy link
Contributor

Sorry for the late response.

We can use an expression like sum(rate(http_response_status_count[1m])) by (path, statusClass). When using this you will get a new entry in your graph when a new path is added without having to update the graph definition. In my opinion this also is more readable compared to using expressions like {__name__=~"^response_.*"}.

The rate function calculates the per-second average rate. See rate explanation.

The recording rules I talked about make sense when the expression is expensive to calculate and so the rule moves the calculation to the server side where it will be calculated at scheduled intervals and avoid calculation for every client request.

Here are the Prometheus metric / label name conventions: Metric and label naming

@abesto
Copy link
Contributor Author

abesto commented Mar 19, 2017

I didn't put one and one together, thanks for explaining things to a Prometheus newbie! I've just understood how the whole aggregation / by clause business works. Totally agree on the way using labels being more readable. (I tried to get a reasonably meaningful HTTP status code graph given the current metrics, but by (name) and by (__name__) don't seem to work. Which is fine, we need to do the Right Thing anyway).

I see two roads ahead at this point, looking mostly to @adriancole for advice:

  1. We can merge this PR as is, get the basic example out there. Then restructure the Prometheus metrics whenever we get around to it (breaking the current dashboard), and update the dashboard on graphana.net. Pro: example is out there ASAP. Con: dashboards will break, unless we double-publish metrics as both http_status{path="/api/v1/services"} and status_200_api_v1_services.
  2. We can put this PR on hold until we get around to restructuring the Prometheus metrics. Pro: the first released dashboard will be kickass, no compatibility problems. Con: delays releasing the example.

@kristofa
Copy link
Contributor

I would get the dashboard out as it is now. It is already useful to show the integration and existing metrics. We can always iterate later.

@codefromthecrypt
Copy link

fyi I noticed that since we added prometheus to zipkin, upstream formalized it.. maybe we should consider their metrics endpoint before formalizing this (or maybe we do it after)
openzipkin/zipkin#1144 (comment)

@abesto
Copy link
Contributor Author

abesto commented Jun 11, 2017

Rebased on top of master.

Re. replacing metrics collection with the upstream client: that won't change the format of the metrics exposed; the official client pretty much does the same as what our current exporter does. It does add some new metrics, with which we can extend the dashboard later (see openzipkin/zipkin#1609)

We can also do some more magic on the Prometheus server side to get nicer response count metrics, will try to do that now (see prometheus/client_java#255 (comment))

@abesto
Copy link
Contributor Author

abesto commented Jun 11, 2017

With the relabeling rules the dashboard can now be significantly smarter, with automatically populated response code count and response time graphs. Updated the dashboard on grafana.net, looks something like this:

screencapture-localhost-3000-dashboard-db-zipkin-prometheus-1497162652511

@kristofa @adriancole Some time has passed, and there are new changes. I think this is ready to merge, waiting for a nod from you :)

http://grafana:3000/api/dashboards/import
echo '{"dashboard": ' > data.json
curl -s https://grafana.com/api/dashboards/${dashboard_id}/revisions/${last_revision}/download >> data.json
echo ', "inputs": [{"name": "DS_PROM", "pluginId": "prometheus", "type": "datasource", "value": "prom"}], "overwrite": false}' >> data.json
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rewrite was needed because as the dashboard JSON grows (more graphs added), we ran into this issue:

xargs: argument line too long

@abesto
Copy link
Contributor Author

abesto commented Jun 12, 2017

Added message count (received and dropped), spans received count, and bytes received graphs to the dashboard. They automatically pick up transports (see the new relabeling rules).

screencapture-localhost-3000-dashboard-db-zipkin-prometheus-1497262914368

@abesto
Copy link
Contributor Author

abesto commented Aug 16, 2017

After updating the rewrite rules for the changes in openzipkin/zipkin#1609, with https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734, here's the current state. Things to note:

  • auto-populated template variable with the Zipkin instances
  • added the Zipkin instance to the label of all metrics, except for the response count which sums over all instances (otherwise we get waaay too many metrics). Even though the graph is not exploded by instance, the filtering is still applied
  • hid the labels from graphs with over two lines of labels
  • made all the labels human-friendly (no more metric{instance=...})
  • added 90th percentile response time graph

screencapture-localhost-3000-dashboard-db-zipkin-prometheus-1502919518468

@codefromthecrypt
Copy link

very nice.. going out for zipkin 2.2

@abesto
Copy link
Contributor Author

abesto commented Oct 10, 2017

Cool!

⚠️ Watch out: the version of the dashboard currently on grafana.com works with the pre-Spring metrics. Once 2.2 is released, https://grafana.com/dashboards/1598/ needs to be updated with the JSON in https://gist.github.com/abesto/642cd049cc75643213b6e4c23bad7734.

@codefromthecrypt
Copy link

codefromthecrypt commented Oct 10, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants