Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store level latency metric #7149

Closed

Conversation

sandeepsukhani
Copy link
Contributor

@sandeepsukhani sandeepsukhani commented Sep 14, 2022

What this PR does / why we need it:
In Reads/Writes dashboard, we have panels for plotting index QPS and latency, but they are specific to boltdb-shipper and bigtable. Each index store exposes its metric, but we can't have all the supported store types in those dashboards.
We have added experimental support for tsdb which is going to be the one we will recommend when it becomes production ready so it is going to be yet another store which a lot of people would be interested in monitoring.

To avoid having latency and qps panel for each index type, I added loki_index_request_duration_seconds metric in PR #6880 but it was only measuring index latencies. In this PR I am renaming it to loki_store_request_duration_seconds to measure overall store request latencies including PUT chunks call.

I have added this metric in a separate panel to Reads / Writes dashboards. We should drop store-specific panels in the next major release. Renaming this metric should not be a problem since we have not done a release since I added it.

@sandeepsukhani sandeepsukhani requested a review from a team as a code owner September 14, 2022 06:58
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Collaborator

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides missing make loki-mixin all clear and safe to me.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Sep 14, 2022
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but if this metric is just for index stores then I'd suggest we change the name to not confuse users (like myself initially) that may think this metric measures all store requests, including chunk stores.

pkg/storage/stores/composite_store.go Show resolved Hide resolved
Namespace: "loki",
Name: "index_request_duration_seconds",
Help: "Time (in seconds) spent in serving index query requests",
Name: "store_request_duration_seconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from #6880 never made it into a release, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this metric only for index stores?

Copy link
Contributor Author

@sandeepsukhani sandeepsukhani Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from #6880 never made it into a release, correct?

yes, latest release 2.6.1 doesn't have it.

Is this metric only for index stores?

All the read requests here are purely index queries, while the write request i.e Put and PutOne, uploads the chunk and indexes it as well. It would be hard to purely track just the index write request without refactoring the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think having a metric named loki_store_request_duration_seconds that doesn't include chunk store reads but only includes writes will be quite confusing.

How hard do you anticipate the refactoring would be? I want to try optimise for clarity as much as possible given that our system is already very complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the code, and it seems quite complex and would take time. The other option would be to keep the old metric loki_index_request_duration_seconds which tracks just the index requests for now, and update just the reads dashboards until we refactor the code. I will keep the refactoring work on my to-do list until then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option would be to keep the old metric loki_index_request_duration_seconds which tracks just the index requests for now, and update just the reads dashboards until we refactor the code. I will keep the refactoring work on my to-do list until then.

Thanks Sandeep, I think that might work out better for our users - if you don't mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannykopping I gave it a try. Here is the PR #7154
Please let me know what do you think if you get a chance to have a look.

@sandeepsukhani
Copy link
Contributor Author

LGTM, but if this metric is just for index stores then I'd suggest we change the name to not confuse users (like myself initially) that may think this metric measures all store requests, including chunk stores.

I can't think of a better name. We will have to refactor the code to measure just the index latency on the chunk flush operations, which would let us rename the metric to loki_index_request_duration_seconds latencies. I think it might be tricky to do it, so I can give it a try in a separate PR.

@sandeepsukhani
Copy link
Contributor Author

Closing this in favour of #7163

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants