Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per-tenant cache metrics #6289

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Jun 2, 2022

What this PR does / why we need it:
In order to improve & troubleshoot our cache performance, we need per-tenant metrics. Right now we track cache value size, hit/miss ratio, request latency, and requested key count on a global level. All of these metrics (except for request latency) are too coarse right now to be very useful. Request latency is irrelevant per tenant because we don't have a way to dedicate a caching backend per tenant, so this would be a very noisy per-tenant metric.

Which issue(s) this PR fixes:
#6318

Special notes for your reviewer:
This setting is defaulted to false due to potential cardinality explosions.

This PR adds a tenant label for the following metrics:

  • loki_cache_value_size_bytes
  • loki_cache_fetched_keys
  • loki_cache_hits

If one has a large number of active tenants, this can result in the creation of many more streams.

Checklist

  • Documentation added
  • Tests updated
  • Is this an important fix or new feature? Add an entry in the CHANGELOG.md.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is headed :)

@@ -1833,6 +1833,10 @@ the index to a backing cache store.
# CLI flag: -<prefix>.cache.enable-fifocache
[enable_fifocache: <boolean>]

# Add tenant labels to cache-related metrics.
# CLI flag: -<prefix>.cache.per-tenant-metrics
[per_tenant_metrics: <boolean> | default = false]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about just always doing this based on if tenancy is enabled (ie. auth = true)? Curious how much cardinality explosion we're really protecting users from here, and we already have sooooo many configs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know it's rude to review draft PRs 😛 but thanks for taking a look

Cardinality is something that I plan to address in the description of the PR, but finishing something else up first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, it's like we're pairing...but asynchronously! yeah, I didn't take too critical an eye, but couldn't resist getting my opinions in early.

@dannykopping dannykopping mentioned this pull request Jun 6, 2022
4 tasks
Signed-off-by: Danny Kopping <[email protected]>
@dannykopping dannykopping force-pushed the dannykopping/per-tenant-cache-metrics branch from 0a7b9a3 to d064453 Compare June 6, 2022 14:01
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Signed-off-by: Danny Kopping <[email protected]>
@dannykopping dannykopping marked this pull request as ready for review June 6, 2022 14:19
@dannykopping dannykopping requested review from KMiller-Grafana and a team as code owners June 6, 2022 14:19
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in favor of adding this due to cardinality concerns on large clusters. It will be helpful on smaller clusters, but at the expense of being a toggle away from a cardinality bomb on large ones :(. Instead, can we include tenant information in logs for later extraction via loki?

@dannykopping
Copy link
Contributor Author

I'm not in favor of adding this due to cardinality concerns on large clusters. It will be helpful on smaller clusters, but at the expense of being a toggle away from a cardinality bomb on large ones :(. Instead, can we include tenant information in logs for later extraction via loki?

I'm fine with parking this for now 👍
#6317 gives us most of what we need for now, except for cache value sizes, but we can factor that in somehow into the logs if need to.

@osg-grafana osg-grafana added type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories and removed area/docs labels Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants